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COMPARISONS OF VARIMAX ROTATIONS WITH 
ROTATIONS TO THEORETICAL TARGETS 


J. P. GUILFORD Ax» RALPH HOEPFNER 
University of Southern California 


Ir is well known by those who factor-analyze, and by some of 
those who do not, that one of the most serious difficulties is the 
rotation problem. It should be agreed that the aim of those who 
apply factor analysis for the purpose of discovering scientific con- 
structs in psychology should be to achieve psychologically signif- 

| icant factors, which can be replicated, which fit into systematic 
psychological theory, and which can be investigated meaningfully 
by other methods. Only in this way can there be general agreement 
upon factorially discovered constructs and thus the unambiguous 
communicability that science requires. 
| As one important step toward this goal, Thurstone proposed his 
| criterion of simple structure. In doing so, Thurstone implied con- 
| siderable faith in the expectation that whatever collection of em- 
pirical variables is reasonably used in a factor analysis, those 
variables would tend to cluster when represented as vectors in 
factor space, with areas of greater density, separated by areas of 
lower density. This general principle has been rather generally ac- 
cepted, although it can be questioned whether, even when stated in 
this rough form, the principle applies to all factor-analytic data and 
is a safe guide as to where psychologically meaningful axes should 
be located (cf. Guilford and Zimmerman, 1963). 

Thurstone did not provide a rigorous mathematical model to 
describe his simple-structure concept. Others have proposed such 
models, which at least take care of certain aspects of the simple- 
structure concept, and which make possible completely objective 
rotations of factor axes found by either centroid or principal- 
component methods of extraction. Probably the most commonly 
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employed orthogonal method in this category has been Kaise 
normal-varimax method (Kaiser, 1958). One of the features of 
simple structure is that each factor should be restricted to a few 
of the empirical variables. Accordingly, Kaiser's method aims to 
ward maximizing the variance within each column of the rotated 
factor matrix. The end result should be a small number of variable 
with relatively high loadings on each factor and a large number of 
zero or near-zero loadings. 

In the years since the varimax method became available, the 
Aptitudes Research Project at the University of Southern California 
has routinely used a computer program for extracting principal 
components that also provides a varimax rotation of axes, after 
the number of common factors has been chosen for rotation. The 
experience of the Project has been that rarely could the varimax 
factors be interpreted with psychological meaning. We have, there- 
fore, resorted to other methods of rotation, which have yielded not 
only meaningful factors in each analysis but factors that can be 
readily replicated with moderate changes of test batteries, and that 
fit into a comprehensive theory known as the “structure of intellect” 
(Guilford, 1967). In earlier analyses, Zimmerman’s (1946) graphic 
method was employed. In the most recent analyses the Project has 
employed Cliff's (1966) analytical method, which, in our applica- 
tion of it, rotates an obtained principal-component factor matrix 
toward a theoretical target matrix. The target matrix is designed 
to represent hypothesized factors in the structure-of-intellect model. | 
Each rotational solution begins by setting up a target matrix, in 
which each test is given a loading on its expected dominant factor 
equal to its vector length, the square root of its communality, and 
zero loadings on all other factors. After the first rotation of axes, 
which gives a least-square fit of the obtained data to the targeted 
loadings, it is observed where modifications must be made in the 
next target matrix in order to achieve a closer fit and to main- 
tain positive manifold. A few iterations of this sort are usually 
sufficient to achieve what is considered to be an optimal fit. 

We are well aware of the limitations that must be recognized 
for conclusions that can be drawn from the application of such 8 
procedure. Such considerations aside, it is our purpose to examine 


1The caricatured example of a “procrustian” rotation of random dat? 
recently reported by Horn (1967) has little or no bearing on the method 
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the results obtained from applications of the two methods of rota- 
tion to a large number of solutions in factor analyses. It was 
hoped thus to learn more about the features of both methods, to 
determine their habitual strengths and weaknesses, and perhaps to 
come out with some ideas for future practice. Both methods give 
orthogonal rotations. The issue of orthogonal versus oblique rota- 
tions thus does not arise. 


Factors to Be Compared 


Since the initiation of the Aptitudes Research Project in 1949, 
there have been 26 independent factor analyses of tests designed for 
various intellectual abilities, in which a total of 417 factors (many 
of them duplicates, of course) has entered into extractions and 
rotations. The total number of tests and other variables has been 
929, among which were also duplications as the same test was 
often used in more than one analysis. In 16 of the analyses, the 
subjects were young, adult males in the range of about 18 to 25 
years of age. Seven analyses utilized senior-high-school students 
of both sexes, two utilized ninth-grade students, and one a class 
of sixth-grade students. New extractions were made in the 26 
studies with rotations of all principal-component matrices, using the 
two rotational methods. Before extraction of the components, a few 
changes were made in some of the test batteries, eliminating all 
tests that were obviously alternate forms of the same test by 
summing their scores for individuals. In a few cases tests were 
eliminated if there was much reason to believe that they were 
sole representatives of their factors in their respective batteries, 
without some likely support from other tests, 

We assumed the structure-of-intellect (SI) theory to be true, and, 
consequently, that the common factors to be found should represent 
SI abilities. There is a multitude of empirical evidence in support 
of this assumption (Guilford, 1967). At any rate, it provides a 
frame of reference for making the study of the methods. One con- 
sequence was that the number of factor axes to be rotated was 
determined by the number of SI factors recognized as being rep- 
used in this study, for his oblique rotations would permit much more the 
capitalization on chance errors, and even so, very rarely did he obtain 
loadings of .30 or greater more than once for each factor. This is in de- 


cided contrast to what we find with Cliffs orthogonal-rotational method 
with real data, 


6 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


resented, even though some might be represented strongly by 
single tests. For example, a vocabulary test would sometimes bea 
sole marker test for SI ability CMU (verbal comprehension) or 4 
perceptual-speed test the only marker for SI ability EFU (evalua- 
tion of figural units). The same number of factors was rotated for 
both targeted and varimax methods. 


Comparison Criteria 


The two methods of rotation will be compared with respect to 
a number of features, all of which have something to do with 
the common criteria of psychological meaning and identity, positive 
manifold, and simple structure. The varimax method aims toward 
simple structure, which means having a small number of tests with 
significant loadings on each factor with many negligible loadings, | 
consequently we have examined the frequency with which different 
numbers of significant loadings appear on factors, where “signifi- 
cant" is given the usual interpretation of +.30 or greater. Another 
feature of simple structure is low factoral complexity for each test. 
We have accordingly examined the complexities of tests as indicated 
by the two rotational solutions. Simple structure implies large num- 
bers of zero or near-zero loadings. We have compared the two 
methods with respect to proportions of zero loadings obtained after 
rotations in connection with selected tests that appeared more fre- 
quently in different analyses. 

Positive manifold is commonly taken for granted as a criterion 
for location of axes in meaningful positions when dealing with 
ability factors. Very small negative factor loadings (not larger 
than —.10) are commonly tolerated as probable chance deviations 
from zero, and even larger negative loadings may be tolerated if 
there are negative correlation coefficients from which they came to 
account for them. We shall see whether unreasonable negative | 
loadings occur by either method of rotation, | 

As indicated at the beginning of this paper, consistent psychologi- 
cal meaning is by far the most important criterion of the success 
of a factor analysis that is designed to illuminate psychological | 
phenomena. The simple-structure criterion was designed only as & 
means to that end. Consistency applies to invariances in factors 
obtained and in relations of tests to factors. We shall be concerned 
about the frequency with which SI ability factors can be identi- 
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fied in solutions by the two methods, and with the frequency with 
which a test or an alternate form of it is significantly loaded on 
the same factor. 

Casual inspection of the varimax-rotated factors found in a single 
analysis commonly shows one or two very “strong” factors, strong 
in the sense that there is an unusually large number of tests 
loaded on each of them, and, on the other hand, a much larger 
number of singlet and doublet factors, each with one or two, re- 
spectively, tests loaded significantly on it. This is true even under 
what would appear to be the most favorable conditions for a suc- 
cessful varimax solution. That is, in our factor analyses it has been 
customary to develop from three to five tests for each new factor 
that is under investigation and to use two or three relatively 
unique marker tests for reference factors that have been previously 
demonstrated. There is often evidence supporting these expectations 
from pretesting studies in the form of generally higher correla- 
tions between tests designed for a new factor than between those 
tests and tests of other factors. 


Comparisons of Rotations 


In our study of the two methods of rotation, it was determined 
how frequently the factors each had one, two, three, etc. tests 
loaded on it. The results are pictured in Figure 1, with two over- 
lapping frequency distributions. The curves do not show that the 
varimax method yields a much larger number of singlet factors, 
but it does show 19 out of 417 factors with no significant loadings. 
The targeted method shows more factors each with two, three, 
four, or five tests, which is in the range where we should expect 
the factors to fall. Such cases are in line with the numbers of tests 
designed for respective factors, as just stated. Where the targeted 
method gave no factor having more than ten tests significantly 
loaded on it, varimax rotations yielded much larger numbers, even 
as many as 26 tests in the extreme case. Such long lists must nec- 
essarily include tests of diverse character, making interpretation 
of such factors next to impossible. Targeting has the virtue of break- 
ing up such composites, at least. 

What percentages of the 417 factors from each of the two methods 
are identifiable? We use the term “identifiable” rather than “inter- 
pretable” here because it was assumed that each factor is inter- 


Number of Factors 


8 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


EI 
o 


o 
e 


e 
o 


= 
o 


w 
o 


m 
a. 


ka Varimax rotations 
g J i 
7 Targeted rotations 


Tests per Factor 


Figure 1. Frequencies with which different numbers of tests with load- 
ings of 30 or higher occurred, with 417 factors in 26 analyses, for varimax 
and targeted (Cliff's) rotational methods. 


preted as an SI ability, whose definition is automatically given by 
its location in the SI model. Rules for identifiability of an SI factor | 
were developed and applied rigorously to factors from both methods. 
An SI factor was regarded as having been demonstrated if: 


1. There was a minimum of two SI-factor tests that have load- 
ings of 30 or greater on that factor. This rule eliminates 
all factors having no loadings of this size, also all singlet 
factors, and all doublet factors whose two tests represented 
different SI abilities. 

2. Even if a factor qualified under rule (1), it was rejected if 
there were also loaded on the factor an equal (or greater) 
number of tests representing a second SI ability, or there 
were also a greater number of tests representing two or more 
other SI abilities, 
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It should be noted that the last-mentioned specification is prob- 
ably unnecessarily strict, and all specifieations ignore relative sizes 
of factor loadings. For example, a factor might have on it two 
tests of an SI ability with loadings near .6 and two or three 
tests of another ability (or other SI abilities) with loadings of 
30 to .35, which might ordinarily be taken as sufficient indication 
as to the identity of the factor, but which was not accepted under 
the rule. 

Other incidental rules had to be made, applying in the varimax 
solutions only. It has sometimes happened that two different vari- 
max factors in the same analysis looked about equally good as 
representatives of the same SI ability, in which case, credit was 
given for only one identification. In other instances a varimax fac- 
tor seemed clearly to be a confounding of two SI factors, in which 
case no credit was given for an identification, for there was failure 
to differentiate them. 

There were two nonintellectual factors for which other rules ap- 
plied. In analyses for which subjects of both sexes were used a sex- 
membership variable was routinely included as a marker for a pos- 
sible sex factor. In analyses of tests that required much writing, 
a marker test for writing speed was included. In the targeted 
method, each of these variables would be targeted and would usually 
emerge as a singlet. In varimax solutions the assumed number of 
factors allowed for this same kind of outcome, but the results were 
not always of that nature. For example, a number of tests would 
have substantial loadings on a factor with the sex variable leading 
the list. One rule for accepting a sex factor or a writing-speed 
factor, then, tolerated singlets, each identified by its marker vari- 
able. Another rule called for acceptance of one of these factors if 
its marker variable had the highest loading on it. 

"These rules were applied to all rotated matrices by the two au- 
thors independently, after which a concensus was reached. The re- 
sults are easily compared. Of the total of 417 factors in each 
case, only 96, or 23 per cent of the varimax factors may be re- 
garded as identifiable whereas 280, or 67 per cent of the targeted- 
rotation factors were identifiable. In view of the expected advantage 
of the target method in this respect, a comparison here is not 
completely fair, but the target method yields about three times as 
many identifications. The very low batting average for the varimax 
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rotations, taken by itself, is an important finding. Less than one 
fourth of all varimax factors could be regarded as identifiabk 
under the rules. If we had depended upon the varimax method to 
arrive at a general theory of intelligence, it is doubtful whether the 
SI theory, or any other theory, could have been generated from 
the factor-analytical results. The consistent rejection of the vati- 
max rotations by the Aptitudes Research Project, therefore, seems 
well justified. Even if we leave out of account the many singlet 
factors, and those having no significant loadings (in the varimat 
case), the percentage of “hits” by the varimax procedure is only 
32 per cent. By the target approach the percentage is 93. This 
comparison is meaningful because in future analyses we could vir 
tually avoid singlet factors when targeted rotations are made, by. 
giving every expected factor at least two representatives in the 
selection of tests for the battery. Our earlier studies commonly in: 
volved singlets because we did not then realize that some SI factors 
were well represented by only one test each. | 

We are next concerned with the question of the factorial com- 
plexities of tests resulting from the two methods. The varimax 
method was not particularly designed to keep complexities low, but 
it might be expected to do so indirectly since it is aimed at simple 
structure. In order to examine this feature for each kind of analysis, 
we selected twelve tests, each used in at least five analyses, and each | 
well-known for representing its own SI ability. We determined the | 
average number of factors on which each of these tests was signifi- 
cantly loaded for the two kinds of rotations. The results are sum- | 
marized in Table 1. It will be seen that the targeted rotations give | 
rather consistently lower averages for the various tests. For al. 
tests combined, the weighted means are 1.23 for targeted rotations 
and 1.54 for varimax rotations. There is some assurance, then, 
that where low complexity of tests is desired, the targeted rotations 
are more likely to provide it. 

Simple structure implies a large proportion of zero loadings F | 
the rotated factor matrix. Both methods in question take steps 
toward this end. Counting all loadings as zero that are in the range 
—.10 to +.10, the proportions of zeros were found for the same 12 
tests as represented in Table 1. The results are presented in Tablé 
2. For these tests the proportions of zeros were rather consistently 
higher from the varimax rotations. The weighted means for all 
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TABLE 1 
Factorial Complezities of Twelve Commonly Used Salient Tests in 
Two Kinds of Rotations 

Average number of loadings 
Number of 5 .30 per analysis 

Test name analyses Targeted Varimax 
Alternate Uses 10 1.6 1.7 
Associational Fluency 10 1.5 1.8 
Consequences—obvious 10 1.1 1.5 
Consequences—remote 10 1.3 1.3 
Ideational Fluency 5 1.4 2.0 
Numerical Operations 1i 1.3 1.4 
Perceptual Speed 6 1.0 1.5 
Plot Titles—clever 12 1.1 1.3 
Plot Titles—nonclever 12 1.2 1.5 
Ship Destination 17 dmt 1.4 
Verbal Comprehension 20 1.1 1.6 
Word Fluency 5 1.2 2.0 

1.23 1.54 


Weighted mean 


the tests were .52 and .59 for the targeted and varimax methods, 
respectively. The difference was not so very great. Both methods 
appear to yield more zero loadings than simple structure would call 
for. Very roughly, Thurstone's specification for simple structure 
would call for about 45 per cent of the loadings in the analyses 
in question to be zeros. We do not know how representative the 12 
selected tests were. 


TABLE 2 
Proportions of All Loadings That Were Z +.10 for Twelve Tests 
in the Two Kinds of Rotations 
Leen eee eee ene e e re 


Number of Proportions for _ 
Test name analyses Targeted Varimax 

Alternate Uses 10 E :48 
Associational Fluency 10 49 -49 
Consequences—obvious 10 „51 -61 
'onsequences—remote 10 -56 -58 
Ideational Fluency 5 -37 e 
Numerical Operations 11 -63 -69 
Perceptual Speed 6 E 65 
Plot Titles—clever 12 -50 B 
Plot Titles—nonclever 12 -60 “72 
Ship Destination 17 -48 mo 
Verbal Comprehension 20 .52 .62 
Word Fluency 5 +52 So 
.52 .59 


Weighted mean 
log eomean 7 | SIs 4o P MR MED 
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Little need be said regarding negative loadings on rotated fac. 
tors, but one or two characteristics call for comments. Experience 
with the targeted method shows that much of the approach of the 
obtained factor structure to the theoretical target is achieved at 
the expense of some relaxation of the satisfaction of the criterion 
of positive manifold. The best fits typically carry with them some 
negative loadings greater than —.1. Many of these can be ac 
counted for in terms of corresponding negative correlation coef- 
ficients, but not all. In the absence of knowledge of a standard 
error of factor loadings, it is difficult to say whether or not the 
negative loadings not thus accounted for can be explained in terms | 
of sampling errors. If we accept the findings of Cliff and Pennell 
(1967) as our guide, the standard error of a zero factor loading 
in our typical samples of subjects would be about .07, where N 
= 200. We might therefore expect a loading of —.14 to be signifi- 
cant at the .05 level and a loading of —.18 to be significant at the 
01 level. It has been possible to keep loadings in the targeted solu- 
tions above the latter level. Loadings more negative than —.18 

_ do occur in varimax rotations, however, and a few have actually | 
exceeded the generally accepted significance level (for interpreta- 
tion purposes) of —.30. In the 26 analyses pertinent to this report, 
there were 14 of such cases, distributed among 12 factor analyses. 
Such loadings certainly do not make sense psychologically for tests! 
of ability. 

We are next concerned with the question of invariance of factors 
and of relations of factors to tests. Clearly parallel studies of this 
question cannot be made in connection with the two kinds of rota- 
tions, because, while the SI factors are usually identifiable in the tar- 
geted solutions they are much less often identifiable in the varimax 
solutions. Confining our attention again to the 12 selected tests that 
have been used in five or more analyses with the targeted rotations, 
we find the data of Table 3. For each test is given the number of 
analyses in which it appeared, the number of times it had a signifi- 
cant loading on its salient SI factor, and the frequencies of signifi- | 
cant loadings on other factors. Because the rules adopted for identi- | 
fication of factors mentioned earlier did not accept singlets for 
identifications, singlets are listed separately. If one accepts singlets 
as accounting for their most likely abilities, it will be seen that 
with very few exceptions (only 3 out of 128, in fact) these tests 

| 
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TABLE 3 


Invariances of Relations of Twelve Selected Tests 
to Factors in Target Rotations* 


Number of Frequency of loadings .30 
Test name analyses on various factors 
Alternate Uses 10 DMG, 8; DMT, 3; DMR, 2; 
CMI, 2; CMT, 1 
Associational Fluency 10 DMR, 9; AF-singlet, 3; 
CMU, 2; DMT, 1; DMC, 1; 
DMS, 1 
Consequences—obvious 10 DMU, 10; CMI, 1 
Consequences—remote 10 DMT, 9; DMC, 2; DMU, 1; 
CMI, 1 
Ideational Fluency 5 DMU, 5; DMS, 1; DMI, 1 
Numerical Operations 11 MSI, 8; NO-singlet, 3; 
DSR, 1; NSI, 1; ESU, 1 
Perceptual Speed 6 EFU, 4; PS-singlet, 2 
Plot Titles—clever 12 DMT, 11; PT-c-singlet, 1; 
DMS, 1 
Plot Titles—nonclever 12 DMU, 12; DMT, 1; DMI, 1 
Ship Destination 17 CMS, 13; SD-singlet, 4; 
CFI, 1; CMI, 1 
Verbal Comprehension 20 CMU, 16; VC-singlet, 4; 
CMR, 1; DMR, 1 
Word Fluency 5 DSU, 4; WF-singlet, 1; 
DMS, 1 


^ Factors named with structure-of-intellect trigrams, except for singlet factors. 


came out with significant loadings on their expected factors. There 
were some significant loadings on other factors, most often of sec- 
ondary size. 

The invariance problem was investigated differently for varimax 
rotations. From this study of the data, the results for 5 of the 
12 selected tests are given in Table 4 for illustrative purposes. Tak- 
ing the Verbal Comprehension test, for example, a vocabulary test 
that is a most faithful and univocal marker for ability CMU (cogni- 
tion of semantic units), we examined the 20 analyses in which that 
test appeared, noting in each analysis on which varimax factor or 
factors it was significantly loaded. We then attempted to give to 
each factor a psychological interpretation and an appropriate name. 
The results for Verbal Comprehension appear in the first section of 
Table 4. For example, Verbal Comprehension appeared significantly 
loaded on factor J in analysis number 3.* In view of the several 
tests that appeared along with it, the factor called for the title of 


pee E y E 
?The numbering of analyses here is in accordance with the numbering 


of the Reports issued by the Aptitudes Research Project. 
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“verbal reasoning.” In analysis number 6, this unique test appeared. 
on three different factors, A, J, and K, which were interpreted as 
“word familiarity,” “verbal reasoning,” and “verbal manipulation,’ 
respectively. Factor A, “word familiarity,” would seem to be clos- 
est in psychological meaning to SI ability CMU. On through the 
list we selected those factors that come closest in appearance to 
CMU. There were eight such factors out of 20, plus one singlet 
factor marked by Verbal Comprehension. 


TABLE 4 


Invariances of Relations of Five Frequently Used Tests to 
Factors in Varimaz Rotations 


VERBAL COMPREHENSION (Vocabulary) 


Analysis Factor Intuitive name of the factor Ratio identified 
3 J Verbal reasoning 
6 A Word familiarity 
J Verbal reasoning 
K Verbal manipulation 
8 A Verbal facility 
9 A Verbal reasoning 
R Verbal understanding 
S Word familiarity and facility 
12 A Understanding orders and sequences 
H Seeing meaningful reorganizations of 
information 
K Verbal comprehension—singlet 
14 F Verbal understanding 
16B C General verbal facility 
G Verbal evaluation 
K Verbal reasoning 
16C G General verbal facility 
L Verbal association 
17 A Verbal knowledge 
18 G General verbal facility 
21 C Verbal knowledge 
22 B Verbal reasoning 
I "Verbal knowledge 
23 E Word familiarity and facility 
I Facility with words 
29 C Verbal knowledge 
32 C Verbal knowledge 
34 J Verbal reasoning 
35 A General verbal ability 
37 I General verbal understanding, 
memory, and production 
38 H Verbal knowledge 
39 D Verbal reasoning 


8/20 
(plus 1 singlet) 
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TABLE 4 (Continued) 


WORD FLUENCY 


17 A Verbal knowledge 
C Word fluency 
23 B Word closure 
M Word fluency—singlet 
26 D General verbal fluency 
H Word fluency 
35 A General verbal ability 
H Word fluency 
39 B Verbal flexibility 
F Word fluency 
4/5 
(plus 1 singlet) 
PERCEPTUAL SPEED 
3 B Perceptual speed 
6 B Perceptual speed 
8 J Perceptual speed 
29 A Word closure 
M Seeing spatial concepts 
83 F Perceptual speed 
835 D fex 
G Numerical manipulation 
V Perceptual speed—singlet 4/6 
(plus 1 singlet) 


———— MÀ Pe SY E 


PLOT TITLES—nonclever 


rahe eee 


12 
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Fluency of ideas 

Commonness of responses in free situations 
"Understanding orders and sequences 
Production of appropriate ideas 
Divergent verbal thinking 

Plot Titles—nonclever—singlet 
Sensitivity to problems 

Fluency of ideas 

Verbal knowledge 

Fluency of ideas 

Speed of writing 

Fluency of ideas 

Fluency of ideas 

Fluency of ideas 

Fluency of ideas 

Plot Titles—nonclever—singlet 
Verbal and figural fluency 


Fluency of ideas 7/12 


(plus 2 singlets) 
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TABLE 4 (Continued) 


PLOT TITLES—clever 


8 M Uncommonness of responses in free 
situations 

12 G Verbal problem solving 
13 F Producing clever responses 
16C T Plot Titles—clever—singlet 
17 D Speed of writing 

F Verbal cleverness 

L Fluency of verbal rearrangements 
18 A General verbal divergent thinking 

G General verbal facility 
21 A Seeing verbal implications 
26 K Plot Titles—clever—singlet 
27 D Verbal flexibility 
34 C  Originality 
35 A General verbal ability | 

S Plot Titles—clever—singlet | 
37 D Plot Titles—clever—singlet 

1/12 
(plus 4 singlets) | 


In similar manner the varimax rotations were examined for sig- | 
nificant relations to the tests Word Fluency, Perceptual Speed, Plot 
Titles—nonelever, and Plot Titles—clever, in turn. The results ap- 
pear in later sections of Table 4. The number of times each vari- 
max factor could be identified as the dominant SI ability represented 
by the test is given, with its ratio to the number of times that 
factor could possibly be so identified. Those ratios were fairly good | 
for Word Fluency’s relation to SI ability DSU and for Perceptual l 
Speed's relation to EFU. They were only fair and poor for thé 
two Plot Titles test scores. These comparisons do not tell the 
whole story, by any means, for close study of the tests that appeared 
along with any one of these marker tests on their respective factors 
would show much variation, even when the factor can be identified 
as a probable SI ability. | 

In order to obtain a better idea of the heterogeneity of tests that 
£o together on a factor in varimax rotations, one needs to examine | 
such lists, as can be done in Table 5. There we have reproduced 
the list of factors and their tests (with singlets omitted), for & 
particular analysis involving 59 empirical variables and 25 fac- 
tors. The tests are listed for each factor in the order of their load- 
ings on that factor, after each test title is given its dominant SI | 


GUILFORD AND HOEPFNER 1 


ability in terms of its trigram label. The first letter of the trigram 
gives the ability's kind of operation (cognition, memory, etc.), the 
second letter gives the kind of informational content (figural, sym- 
bolic, etc.), and the third gives the kind of product of information 
(unit, class, etc.). 


TABLE 5 
Where SI Factor Tests Went in a Varimaz Rotation* 


A. 

.07 Verbal Comprehension CMU 
.63 Simile Insertions DMR 
.59 Alternate Uses DMC 
-58 Associational Fluency DMR 
57 Multiple Analogies DMR 
.48 Expressional Fluency DMS 
AT Simile Interpretations DMS 
E Seeing Problems CMI 
.42 Consequences—remote DMT 
.98 Planning Elaboration II DMI 
E Plot Titles—clever DMT 
97 Possible Jobs DMI 
+35 Word Fluency DSU 
.33 Ideational Fluency DMU 
.32 Seeing Trends II CSR 
.32 Suffixes DSU 
31 Alternate Signs DMT 

B. 
M Make a Figure Test—flu. DFU 
.65 Monograms DFS 
.54 Designs DFS 
.58 Make a Mark DFU 
.51 Dot Systems Test DFU 
.40 Sketches DFU 
.83 Making Objects DFS 
.33 Marking Speed FS 

C. 
.08 Match Problems III DFT 
.67 Match Problems IV DFT 
.66 Match Problems V DFT 
.42 Planning Air Maneuvers DFT 
.86 Dot Systems Test DFU 
.92 Match Problems II DFT 

D. 
-76 Sex membership Sex 
.40 Perceptual Speed EFU 
-36 Limited Words DSI 
.35 Name Grouping DSC 
.83 Ideational Fluency DMU 


.81 Word Relations CSR 
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TABLE 5 (Continued) 
E. 
.07 Make A Code Test DSS 
.59 Match Problems II DFT 
.43 Multiple Grouping DMC 
F. 
.58 Varied Symbols DSC 
.32 Expressional Fluency DMS 
G. 
-76 Plot Titles—nonclever DMU 
47 Consequences—obyious DMU 
.46 Monograms DFS 
+35 Sketches DFU 
81 Planning Elaboration II DMI 
H. 
73 Number Rules DSR 
72 Alternate Additions DSR 
62 Number Combinations DSR 
62 Number Grouping DSC 
61 Word Relations CSR 
57 Symbol Elaboration DSI 
55 Seeing Trends II CSR 
54 Alternate Letter Groups DFC 
+54 Numerical Operations MSI 
AT Multiple Analogies DMR 
.45 Limited Words DSI 
.89 Figure Classification CFC 
+85 Match Problems III DFT 
+83 Figural Similarities DFC 
+33 Seeing Problems CMI 
.30 Perceptual Speed EFU 
-30 Picture Classification CFC 
I 
62 Word Fluency DSU 
.59 Suffixes DSU 
+35 Expressional Fluency DMS 
.82 Ideational Fluency DMU 
.82 Numerical Operations MSI 
J. 
41 Multiple Grouping DMC 
-30 Alternate Letter Groups DFC 
K. 
45, Picture Classification CFC 
44 Figure Classification CFC 
L. 
.62 Decorations DFI 
-38 Alternate Uses DMC 
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‘TABLE 5 (Continued) 
M. 
.52 Production of Figural Effects DFI 
—.98 Limited Words DSI 
N. 
.42 Sketches DFU 
.82 Possible Jobs DMI 
0. 
65 Utility Test—shifts DMC 
38 Consequences—obvious DMU 
.96 Alternate Signs DMT 
.93 Make a Mark DFU 
ne 
.60 Figure Production DFI 
AT Planning Air Maneuvers DFT 
40 Making Objects DFS 
36 Alternate Signs DMT 


* Nine singlet factors omitted. 


In examining the list of tests for a factor, there is special inter- 
est in noting how much consistency there is for operation, content, 
and product. For example, for the first factor mentioned in Table 
5, the leading test is for CMU, a cognition ability, with only two 
other tests in the list pertaining to the same kind of operation. All 
the others, save two, pertain to divergent-production abilities. The 
factor might be recognized as some kind of general divergent- 
production ability, were it not for the presence of three cognition 
tests, one of them leading the whole list, and the fact that divergent- 
production abilities are represented in almost every factor list. The 
whole battery emphasized divergent-production tests. As to con- 
sistency of content, most of the tests in the first list pertain to 
semantic abilities, the symbol for which is “M.” Only two tests 
have different content (S, for symbolic). The first factor might be 
interpreted as a general semantic ability, in contrast to the next 
two factors, which could lay claim to being figural abilities, which 
they are, but they are special, not general, as we shall see. All 
six kinds of products are represented in the list for the first factor, 
so there is no chance of identifying it in terms of any one of them. 

.. To one who is well acquainted with structure-of-intellect affairs, 
it is obvious that the second varimax factor here is a confounding 


| 
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| 
of two SI abilities—DFU and DFS (divergent production of figural 
units and divergent production of figural systems). It happen 
that those two factors were unusually difficult to separate in orthog- 
onal rotations of other kinds, because, with one or two exceptions, 
in both cases tests of the one factor had secondary loadings on 
the other. 

In making a targeted rotation, the tests with the label DFU ib 
this second list were given target loadings in that direction and 
tests with the label DFS were given target loadings in an orthogonal 
direction for DFS. After some successive rotations, with revised 
target values, the seven tests of SI abilities separated as shown it 
Table 6. It will be seen that the four tests labeled for DFU could 
be represented by two factors with the dominant loading on DFU 
and a secondary loading on DFS in each case. Only one of the load- 
ings on DFS was significant. The three tests labeled for DFS came 
out with dominant loadings on DES, but with two having also sig: 
nificant loadings on DFU. The lack of full univocality for tests of 
these two factors can be attributed to the inability of the test 
writers to control the examinees’ processing of information in terms 
of both units and systems in these tests, in spite of the intention 
that only one kind of product be involved in each test. A separation 
of the two factors can be considered effected, and improved tests 


TABLE 6 
Loadings of Figural-Units and Figural-Systems on 
Two Divergent-Production Factors | 
TTE e ——————A€A— 
Loadings 
Test DFU DFS 
Make a Figure Test—Fluency 61 AL 
Sketches 52 21 
Dot Systems .52 .18 
Make a Mark .46 .24 
Heec .97 .52 
onograms .96 .54 
Making Objects .16 .50 
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might preclude the involvement of more than one kind of product 
in each test. 

The third factor in the list in Table 5 is an example of as clear 
8 case as we should expect of an SI ability being represented by & 
varimax factor. It is clearly the DFT ability, divergent production 
of figural transformation (also known as “adaptive flexibility”). 
The presence of the four somewhat similar match-problems tests 
may have helped, but, except possibly for match-problems tests 
II and III, they cannot be regarded as strictly alternate forms of 
the same test. 

The second factor in Table 5 that has an unusually large list of 
tests loaded on it features tests of ability DSR (divergent produc- 
tion of symbolic relations) among the most heavily loaded. But 
the additional conglomerate list of tests differing in terms of opera- 
tion, content, and product, spoils the picture of a factor representing 
SI ability DSR. The factor immediately following in the second 
column starts out with two good marker tests for ability DSU (word 
fluency), but there follow three tests that rarely go with it in 
targeted rotations, to spoil the picture of a DSU factor. 


Conclusions 


These examples contribute a great deal to demonstration of the 
kinds of weaknesses of the varimax method of rotation. To be sure, 
there are suggestions of theoretically expected factors here and 
there, but there is rarely a clearcut representation of such an ability. 
The major trouble is that in unexplored territory one would not 
know which factors represent genuine psychological variables and 
which ones do not, nor would one be confident that all the tests 
belong on a particular factor, or that all the tests that should be on 
the factor are found there. The varimax factors having relatively 
long lists of tests on them are almost certain to be confoundings 
and confusions, for the tests associated with them are heterogeneous 
in character. 

Varimax rotations may have yielded more psychological sense 
in other areas of psychological investigation, but we know of no 
instance in which a thoroughgoing test has been made of this pos- 
sibility such as we have reported in the area of intellectual abilities. 
We have not made similar tests of the validity of alternative ana- 
lytical rotational methods with the same experimental data. A cer- 
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tain amount of such investigation has been done by others, with 
apparently no more promising results.* 

In psychological factor analyses, then, we seem to be left on 
two horns of a dilemma. Hither we are compelled to be limited to 
the kind of hypothesis testing and theory such as we have done 
employing rotations directed by knowledge of theory, or to risk 
outcomes that are bewildering, misleading, and lacking invariance, 
by employing rotational methods that are mathematically elegant 
and completely objective. As psychologists, we should much prefer 
the former. Experiences such as we have reported lead to suspicions | 
that completely satisfactory, purely objective methods can not be 
developed. It is often said that methods such as that using theoret- 
ically targeted goals come in a category facetiously referred to 
as “procrustian.” It can well be argued that a method that de- 
mands the fitting of data to some arbitrary mathematical model 
that does not match psychological reality better deserves that cate- 
gory title. 
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A PROCEDURE FOR SAMPLE-FREE ITEM ANALYSIS 


BENJAMIN WRIGHT an» NARGIS PANCHAPAKESAN 
University of Chicago 


Our purpose is to describe in detail a convenient procedure for 
performing a new kind of item analysis. This new item analysis is 
different in a vital way from that described in textbooks like 
Gulliksen’s Theory of Mental Tests and used in computing pro- 
grams like TSSA2. The difference is that (a) test calibrations are 
independent of the sample of persons used to estimate item param- 
eters, and (b) person measurements, the transformation of test 
scores into estimates of person ability, are independent of the 
selection of items used to obtain test scores. 

The procedure for sample-free item analysis is based on a very 
simple model (Rasch, 1960, 1966a, 1966b) for what happens when 
any person encounters any item. The model says that the outcome 
of such an encounter is governed by the product of the ability of 
the person and the easiness of the item and nothing more! The 
more able the person, the better his chances for success with any 
item. The more easy the item, the more likely any person is to 
solve it. 

This means that variation in additional item characteristics like 
guessing and discrimination must be dealt with during the con- 
struction and selection of items for the final sample-free pool. The 
aim is to create a pool of items with similar discrimination and 
minimal guessing. Since the method for measuring person ability 
is quite robust with respect to departures from the assumption 
that the only characteristic on which items differ is easiness, this 
aim is not difficult to satisfy. The procedure to be described in- 
cludes a statistical test for item fit which facilitates the identifica- 
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tion of “bad” items which do not conform to the assumptions of 
the model. 

The use of this simple model for mental measurement makes 
it possible to take into account whatever abilities persons in the 
calibration sample happen to have and to free the calibration of 
test items from the particulars of these abilities, As a result no 
assumptions need be made about the distribution of ability in 
the target population or in the calibration sample. 

In its mathematical form this model for sample-free item 
analysis says that the observed response aw of person n to item i 
is governed by a binomial probability function of person ability 
Z, and item easiness E; The probability of a right response is: 


Pr(a,; = 1) = ZE,/( + Z,E) q) 
and the probability of a wrong response is: 


Pr(a; = 0) = 1 — Pray = 1) = 1/0 + ZE). — (0) 
Taking advantage of the convention that am = 1 means right and | 
ns = 0 means wrong we can combine (1) and (1’) to give: 


Pr(dy:) = (Z,E)"'/Q + ZB). B 
It is also convenient to express (2) in an alternative form in which 
we write the model parameters Z, and E, in a log form as follows: 


Pr(as) = exp (an(n + d))/(1 + exp (b, + d:)) @) 
where b, = log Z, and d, = log E, 

An important consequence of this model is that the number of 
correct responses to a given set of items is a sufficient statistic for 
estimating person ability, This score is the only information needed 
from the data to make the ability estimate. Therefore, we need 
only estimate an ability for each possible score. Any person who 
gets a certain score will be estimated to have the ability associated 
with that score. All persons who get the same score will be estimated 
to have the same ability. 

This encourages us to rewrite (3) in terms of score groups. l 
| 
Pra.) = exp (a, + d))/ü exp, d) O 
where j is the score obtained by person n and all persons with à 
Score j are estimated to have the same probability governing thei! 
responses to item 7. 


There are two stages in the measurement of person ability. Thé. 
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first stage, item calibration, consists in estimating the item param- 
elers d; and their standard errors. This is done by analyzing the 
responses of a sample of N persons to a set of k items. It is during 
this stage that items are discarded which do not satisfy the cri- 
teria considered important from the point of view of the model. 
In typical item analysis desirable characteristics of a test are high 
reliability and validity, therefore items with low indices of reli- 
ability or validity are dropped. For this sample-free model the 
essential criterion is the compatibility of the items with the model. 

The failure of an item to fit the model can be traced to two 
main sources. One is that the model is too simple. It takes account 
of only one item characteristic—item easiness. Other item param- 
eters like item discrimination and guessing are neglected. As a 
matter of fact, parameters for discrimination and guessing can 
easily be included in a more general model. Unfortunately their 
inclusion makes the application of the model to actual measure- 
ment very complicated, if not impossible. The sample-free model 
assumes that all items have the same discrimination, and that the 
effect of guessing is negligible. Our experience with the analysis of 
real data suggests that the model is quite robust with respect to 
departures from these assumptions. 

The other source of lack of fit of an item lies in the content of 
the item, The model assumes that all the items used are measuring 
the same trait. Items in a “test” may not fit together if the “test” 
is composed of items which measure different abilities. This in- 
cludes the situation in which the item is so badly constructed or so 
mis-scored that what it measures is irrelevant to the rest of the 
“test.” 

If a given set of items fit the model this is the evidence that they 
refer to a unidimensional ability, that they form a conformable 
set. Fit to the model also implies that item discriminations are 
uniform and substantial, that there are no errors in item scoring 
and that guessing has had a negligible effect. Thus the criterion 
of fit to the model enables us to identify and delete “bad” items. 
Item calibration is concluded by reanalyzing the retained items 
to obtain the final estimates of their easinesses. 

In the second stage, person measurement, some or all of the 
calibrated items are used to obtain a test score. An estimate of 
person ability and the standard error of this estimate are made 
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from the score and from the easinesses of the items used. The flex- 
ibility of being able to use some or all of a set of items in a “test” 
is an important advantage of this method of item analysis. Mean- 
ingful comparisons of ability can be made even when the particular 
items used to make the different measurements are not the same. 
The number of items selected for any measurement can be deter- 
mined by the testing time available and the accuracy required. 

In this procedure the “reliability” of a test, a concept which 
depends upon the ability distribution of the sample, is replaced 
by the precision of measurement. The standard error of the ability 
estimate is a measure of the precision attained. This standard 
error depends primarily upon the number of items used. The range 
of item easiness with respect to the ability level being measured, 
also affects the standard error of the ability estimate. But in prac- 
tice this effect is minor compared to the effect of test length. It is 
possible to reach any desired level of precision by varying the 
number of items used in the measurement, just providing that the 
tange of item easiness is reasonably appropriate to the abilities 
being measured. 3 

We shall describe two methods for the estimation of item and 
person parameters and their standard errors. Both methods are 
Such that ability estimates are obtained at the same time as item 
estimates. The equations used for person measurement, given cal- 
ibrated items, are similar to those used during item calibration. 
The difference being that during person measurement the items 
are assumed calibrated, and so item easinesses are no longer esti- 
mated but kept fixed. However, one is not usually interested in 
ability measurement at the stage of item calibration. Usually a 
pool of items are calibrated first and then later used selectively 
for measurement. 

The first method of estimation uses unweighted least squares 
and will be referred to as LOG. The second method uses maximum 
likelihood and will be referred to as MAX. In general MAX is pref- 
erable to LOG. MAX gives better estimates of the model param- 
eters, and the standard errors of estimate are better approximated. 
However, when the calibration sample is large, and the ability range 
of the sample is wider than the easiness range of the item param- 
eters, then the item estimates obtained by LOG are equivalent to 
the estimates obtained by MAX. 
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In general we recommend that MAX be used whenever possible. 
Our reason for describing LOG is that it is conceptually and com- 
putationally simple. If a small computer is unavailable, LOG can 
be used to obtain rough parameter estimates and their standard 
errors. 

Despite the simplicity of LOG we would like to emphasize that 
MAX is not much more complicated. The characteristic which 
makes MAX more difficult to use is its system of implicit equations 
which must be solved by an iterative procedure. This iterative pro- 
cedure is easy to perform on a small computer but tedious on a 
desk calculator. 


Methods 


A. LOG Method: 


1. Description. 

The log method of estimation is based on using the observed 
proportion of successes aj/r; within a particular score group j as 
an estimate of the probability pj of obtaining a correct response, 
for any person in score group j, to an item of easiness E, = exp di. 

Dis Rd s/t; 
Pii = exp (b; + dj)/(1 + exp (b; + d:)) (5) 
where b; is the ability associated with score group j 
r; is the number of persons in score group j 


a;; is the number of persons in score group j who get item 4 
correct. 


and — (r, — a) /r, œ 1/(1 + exp (b + d) 
50 a;:/(r; — a) & exp (b; + 4) 


and — tye = log (ay/(r, — 859) 2 bi + do (6) 
80 ij = bt + di* W 
where d,* = estimate of d; 
and  b,* = estimate of b;. 


This leads to the estimation equations 
d* —d* =t- t. @) 
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where d* 2 (/53; d;*, 

be = 0/6 - 9) È ty, 

ie WÈ & 


Since there is an indeterminancy in the scale of easiness we can 
determine the scale so that d.* = 0 to give: 


lgE*-d*-t,—t. (9) 


as the basic equation for estimating item easiness. 
We also obtain an estimation equation for ability: 


log method. 

To calculate standard errors of the estimates bj* and d,* we need 
expressions for the variance of ty. This is obtained from the vari- 
ance of aj. The number of successes a; in the score group j has a 
binomial distribution, and hence the variance of aj will be given by: 


Vlan) = "ipi(l — pj.) 
where pj, is the probability of obtaining a success, The variance of 
ta can be approximated from: 


Vli) = (8t;,// ða)? V (az) 
= lrip(1 — pi) 
or WG) = 1/rip,*(1 — Dis*) (1) 
Where pj* = exp (b;* + d;*)/(1 + exp (b,* + d;*)) 
and (0t,;/da;;) is the partial derivative of t;; with respect 
to a;, and equals 1/r;p,*(1 — pH) 


From (9) we get for the variance of dj: 


V(à*) = Vd. — t). 

We know that the t,,’s are independent with respect to variation 
in j, that is for given 4, tj; and t, are independent, because they come 
from different groups of persons. However, there is a relationship 
between ¢;; and t; for any score group j because of the constraint 
k 
27a; = jr;. In fact, the actual covariances between £j; and f, are 


log Z;* = bt = 4, — t... (10) 

Equations (9) and (10) are the basic estimation equations for the 
| 
very small. For simplicity we will assume that the t;;’s are indepen- | 
> 
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dent of each other in both directions. Then for the variance of d;* 
we get: 


V(d,*) = à — 1/E)V(t.) < VE) 
80 e Vit.) 


Vea) = (/6 = D) X vt. (12) 


This approximation is conservative. The exact variances of esti- 
mates are smaller than those given by (12). The standard error 
of the ability estimate is approximated by: 


PON = A/D Vit). a3 


Procedure 


A. Data Handling 


The observations consist of the responses of N individuals to each 
of k items which compose the test. The response to an item is 
coded 1 or 0, 1 if the response is correct and O otherwise. (The 
procedure is restricted to dichotomous items, i.e. to items that can 
be coded right or wrong.) 

A k-dimensional response vector I of 1’s and 0’s can represent 
the response of an individual to the test. Hence, the data could be 
conceived of as an N x k matrix containing the responses of all 
the N persons to the k items. However, for estimation that matrix 
contains superfluous information because the ability estimate of 
an individual is entirely dependent on his score—the exact pattern 
of responses is immaterial. We do not need to know the response 
of an individual to a particular item, but only his total score to 
classify him according to estimated ability. 

The distribution of estimated ability for the whole sample can be 
summarized in a score vector R of dimension k-1. The element 7; 
of the vector R is set equal to the number of persons with a score 
of j. 

Scores of 0 and k are excluded because they do not contribute to 
the item calibration. They provide no differential information about 
the items. For these people all the items appear either equally hard 
or equally easy. In fact we cannot obtain point estimates of 
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where d* = WÈ d;*, 

t, = 0/6 = D) 3. ts 

qe WÈ i? 


Since there is an indeterminancy in the scale of easiness we can 
determine the scale so that d.* = 0 to give: 


log E*-d*-t— t.. (9) 


as the basic equation for estimating item easiness. 
We also obtain an estimation equation for ability: 


log Z;* = b;* = t;, — t... (10) 
Equations (9) and (10) are the basic estimation equations for the 
log method. 

To calculate standard errors of the estimates bj* and d;* we need 
expressions for the variance of ty. This is obtained from the vari- 
ance of aj. The number of successes ay in the score group j has a 
binomial distribution, and hence the variance of a, will be given by: 


Vlan) = rip (4 — Dis) 
where py is the probability of obtaining a success, The variance of 
t4 can be approximated from: 
Vht) = (81,,/0a;;)* Vlas) 
£ lripi(1 — p) 
or V*(4) = rpt p”) (1) 
where p,* = exp (b;* + d,*)/(1 + exp (6,* + dj) 


and —_(0t;:/da,;) is the partial derivative of t;; with respect 
to a, and equals 1/rjp*(1 — p&) 
From (9) we get for the variance of d;*: 


V(à*) = Vt: — t). 
1 We know that the ¢,,’s are independent with respect to variation 
in j, that is for given 7, t, and t,, are independent, because they come 


from different groups of persons. However, there is a relationship | 


between tj; and tj for any score group j because of the constraint 
Da = jr;. In fact, the actual covariances between t;; and t; ate 
very small. For simplicity we will assume that the /,;'s are indepen 


a a 
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dent of each other in both directions. Then for the variance of d;* 
we get: 


V(d,*) e à — 1/E)V(t.) < V(t.9 
s0 œ V(t) 


Va) = Oe = D) È Va. a2) 


This approximation is conservative. The exact variances of esti- 
mates are smaller than those given by (12). The standard error 
of the ability estimate is approximated by: 


Y*6/5 = AD Y). (3) 


Procedure 


A. Data Handling 


The observations consist of the responses of N individuals to each 
of k items which compose the test. The response to an item is 
coded 1 or 0, 1 if the response is correct and 0 otherwise. (The 
procedure is restricted to dichotomous items, i.e. to items that can 
be coded right or wrong.) 

A k-dimensional response vector I of 1’s and 0’s can represent 
the response of an individual to the test. Hence, the data could be 
conceived of as an N x k matrix containing the responses of all 
the N persons to the k items. However, for estimation that matrix 
contains superfluous information because the ability estimate of 
an individual is entirely dependent on his score—the exact pattern 
of responses is immaterial. We do not need to know the response 
of an individual to a particular item, but only his total score to 
classify him according to estimated ability. 

The distribution of estimated ability for the whole sample can be 
summarized in a score vector R of dimension k-1. The element ry 
of the vector R is set equal to the number of persons with a score 
of j. 

Scores of 0 and k are excluded because they do not contribute to 
the item calibration. They provide no differential information about 
the items, For these people all the items appear either equally hard 
or equally easy. In fact we cannot obtain point estimates of 
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ability for such people. Items which everyone gets right or everyone 
gets wrong are also excluded. At the calibration stage we cannot | 
obtain point estimates for them from the sample, and at the meas- 
urement stage at least among the calibrating sample they do not | 
provide differential information about the ability of the individuals 
being measured. | 

Thus the original N X k data matrix can be collapsed into a 
(k — 1) X k matrix A, such that an element aj represents the 
number of persons with a score of j who get item i correct. This 
A matrix contains all the information bearing on test calibration. 

The first step in the procedure then consists in computing A and | 
R. The total number of persons N^ (excluding those that get zero 
and maximum scores) ean be counted at the same time. The | 
most convenient way of setting up the matrix A and vector R is 
to read in one case (vector I) at a time. The score j is calculated 
by summing over all the responses. 


| 


i= a4 


k 
ja he =D an 


Ifj = 0 or k the case is disregarded and the next case is read in. 
When j is in the permissible range the appropriate accumulation 
is made to R and A. This is demonstrated below in terms of à! 


FORTRAN program segment which can be used as a subroutine 
acting on each case: 


J=0 
DO1L- LK 
1 J=J+1(L) 
IF (J) 7,7,3 
IF (J — K) 4,7,7 
4 DO6L=1,K 
6 IA(J,L) =IA Q,L) + I(L) 
R(J)  R(Q) +1. 
RN 5 RN +1, 
7 CONTINUE 


e 


I = Response vector I 
IA — Matrix A in fixed point 


l————— e — 
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K = Number of items k in test 
RN = N’ number of persons with scores not 0 or K 
R = Vector R of score group sizes. 


It is assumed that IA, R and RN are zeroed before any cases are 
accumulated into them. 

If any r; is zero we disregard the score group j. An empty score 
group does not contribute any information to the item estimation or 
to the test for the item fit. Also in the case of the log method we 
cannot obtain ability estimates directly for empty score groups. 
Therefore, the number of useful score groups are score groups which 
have one or moré persons in them. We compute m, the number of 
such useful score groups by scanning the vector R, 


k-1 
m=} z; where z;=1 if 7; >0 (15) 
i 
a, =0 if r,=0. 
The information from the data contained in R, A, N’ and m is 
enough to enable us to estimate the model parameters and their 
standard errors, 


b. Estimation 


To get estimates by the log method we transform the data in A 
to a matrix T where the element £j; is given by 


t4 = log (aj;/(r — 252). (16) 
We run into problems when aj = 0 or when ag = ry, because at 


these values t; is infinite. To avoid this difficulty we modify T such 
that: 


tj; = log (aj, + w)/(r; — a« + w)) (17) 
where w = r;/N'. 
The advantage of this adjustment is that now when aj, = 0 or 
Qi: = r; then t;; = Flog (1 + N’). These limits for extreme values of 
ti; seem reasonable, because for N’ persons log (1 + N’) is an outside 
limit on the magnitude that any cell in T can take. Thus the matrix T 
is set up using the expression (17) for each element of the matrix. 

The estimates d,* are obtained from T using (9) 


dë = tet, = (am S ta) - (unn Eu) 09 


riro 
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In principle this is as far as we need proceed to obtain item esti- | 
mates by the log method, but the d;"s obtained above contain the 
extreme values for the empty and full cells in A, i.e. when aj; — 0 | 
or a;; = rj. We can improve the estimates by substituting values | 
for the unknown ¢;,’s according to the model. To do this we also 
need the ability estimates, which are obtained from T by (10) | 


k 
b* = i - t, = (am È tu) - t. (9 
i 
From the model the estimated value we get for the cell tj is: 


i = di* +b H4 E (20) 


\ 
therefore for the extreme cells we substitute this value in place of 
= log (1 + N’). 

With these new values for the unknown cells in T we again 
compute d," and b;* according to (18) and (19). The results will 
differ from the previous values depending upon the number of 
empty and full cells in the matrix A. 

The program steps in FORTRAN required for obtaining the | 
estimates d;*, b;* and the matrix T are shown below. 


FK-K 
FL = LOGF (RN) 
NGK-K-1 
SUM = 0 
FM = M 
DO1J=1,NGK 
1 BI) =0 
DO2L=1,K 


D(L) =0 
D03J-1,NGK 
IF (R (J)) 4,4,5 
4. TESO 
GO TO3 
5 T1 -IA(,L) 
T2 = R(J) — T1 
DEL = R(J)/RN 
T1 = LOGF ((T1 + DEL)/(T2 + DEL)) 
B(J) = B(J) + T1 


10 


18 


14 
15 


12 


11 


17 


18 
16 


19 
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D(L) 2D(L)- Tl 

TU, by) Sepp 

D(L) = D(L)/FM 

SUM = SUM + D(L) 

SUM = SUM/FK 
DO6J=1,NGK 

IF (R(J)) 7,7,8 

BiJ) =0 

GO TO 6 

B(J) = B(J)/FK — SUM 
CONTINUE 
DO9L=1,K 

D(L) = D(L) — SUM 
TUM =0 

DO 10 J = 1, NGK 

Y(J =0 

DOML=1,K 

X(L) =0 

DO 12J=1,NGK 

IF (R(J)) 12, 12, 13 

TYS TL) 

IF (ABSF (T(J, L)) — FL) 15,15,14 
T1=B(J) + D(L) + SUM 
X(L) = X(L) + T1 

Y(J = Y(J) + T1 
CONTINUE 

X(L) = X(L)/FM 

TUM = TUM + X(L) 
TUM = TUM/FK 

DO 16 J =1, NGK 

IF (R(J)) 17,17,18 

BiJ) =0 

GO TO 16 

B(J) = Y(J)/FK — TUM 
CONTINUE 
DO19L=1,K 

D(L) = X(L) — TUM 

B is the vector of ability estimates 
D is the vector of item estimates. 
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Methods 


B. MAX Method: 
1 Description. 


Maximum likelihood is a widely used method for estimating | 
model parameters. The assumption involved in obtaining parameter 
estimates is that the observed data is the most likely occurrence. 
Parameters are estimated so that they maximize the probability 
(likelihood) of obtaining the sample of observations. 

The equations obtained when the condition of a maximum likeli- _ 
hood is satisfied for the sample free model (3) in the introduction 
are: 


k-i 


a,, = >) (r; exp (b;* + d;*)/(1 + exp (b;* + d.*))), i 


: $912. | 
i= »» (exp (6,* + d,*)/(1 + exp (b;* + d,)), | 
j212-.k—1 (22) 77 

where a,, = number of persons who get item 7 correct (item score) 


j = the score, an ability estimate is obtained for each 
score ! 


r; = number of persons in score group j, 
and the log likelihood is 


k k 


22. aj; (b; + dj) — 2 log (1 + exp (b; + d). 


The method consists in computing d;* and b;* from the implicit 
equations (21) and (22). It should be noted that each of the equa- 
tions (21) involves only one item estimate, even though it 
does depend on all (k — 1) ability estimates b;*. Similarly, each 
equation in (22) involves only one ability estimate and of all the 
item estimates d,*. We handle these equations as two independent 
sets, and solve them accordingly. | 
When the item estimates are assumed known, (22) is the sel 
of equations used for person measurement. From (22) we can ob- 
tain a scoring table, a table which will show the estimated ability 
corresponding to every score, for a given set of items, This scoring 
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able involves only the item estimates. Therefore, a scoring table 
T ean be provided for any specific test, and the ability of an indi- 
‘vidual ean be estimated by looking up his score in the scoring 
- fable, Once the scoring table is obtained no further computations 
are necessary. Thus computations are in general only necessary 
at the item calibration stage. They become necessary at the meas- 
_ urement stage only if one does not want to use a set of items for 
which a scoring table has been provided. 

The approximation of a standard error for item estimates can 
be approached in two ways. In equation (21) we can assume that 
the variance of the item estimate is due primarily to the uncer- 
tainty in the item score a;,. To a first approximation this gives: 


= 


j V(d;*) ~ (0d,/da,;)’V(a,:) 
which from (21) leads to: 


Vd") ~ 1 1 Xe, exp (b;* + d:*)/(1 + exp (b,* + d,9))). (23) 


An alternative is to approximate the standard error from the 
asymptotic value of the variance of a maximum likelihood estimate. 
But this leads to the same equation (23). 

To obtain estimates for the item parameters, we have to solve 
the two sets of equations (21) and (22). Since these equations are 
implicit in d,* and b;*, we cannot solve them directly. In our anal- 
ysis we use the Newton-Raphson procedure to solve for the un- 
known parameter estimates. This procedure is an iterative one. We 
start with an initial estimate Zo, and using the Newton-Raphson 
equation obtain an improved estimate zı. Now using the new value 
% as the starting estimate, we repeat the procedure until the esti- 
mates do not change appreciably. If f(z) = 0 is the implicit equa- 
tion to be solved for x, the value of z at the (n + 1)th iteration is 
given by: 


Tuas = m, — (f(2)/f'@)) enon (24) 
_ Where z, = value of z at the nth iteration 
(£) = df(x)/dz, the differential of f(z) with respect to z 
and f(2)/f'z) is evaluated at z = 2» 


Equation (24) is suitable for equations which are functions of 
only one unknown. This is adequate for our purposes because we 
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can solve (21) and (22) as two independent sets of equations, in 
which each of the k equations in (21) and each of the (k — 1) | 
equations in (22) are locally functions of only one unknown. 

To facilitate a description of the procedure we write equations | 
(21) and (22) in a form analogous to equation (24). 


Te pa") = a. — Ee exp (d;* + b;*)/(1 + exp E. -+ e 


d,* = d,* — (W)/F Jarman 


Pa”) = -5c exp (d;* + b;*)/(1 + exp (d,* + b,9)) — (9) 1 
Also if g(b = j — Y (exp (b;* + d,*)/(1 + exp (b,* + dë), 
j242--k-1 
b,a* = b — (Q(59)/9 (09)... qn 


abi") = -I (exp (b;* + d)/0 + exp(* cà). — 09 


Since the method is iterative, we need some basis for termination. 
We employ two different criteria for judging whether convergen 
has been reached. An obvious consideration is to look at the aver- 
age squared difference SD between the values of estimates obtained 
from two consecutive iterations. If SD is less than some criterion 
value SC, we stop the procedure, because insufficient improvement 
is obtained in the estimates by continuing the procedure further 
An alternate criterion is to monitor the value of the likelihood 
function. This can be accomplished by computing the likelihood # 
each iteration and observing the rate of increase. If things are ? 
they should be, the likelihood will increase rapidly at first, até 
then become approximately constant. The procedure can be stopp 


hos the increase in the likelihood is less than some specified val! 


Procedure 


The first part of the procedure for MAX is the same as thal 
described for LOG. The data is edited in exactly the same sa 
and the LOG procedure followed until initial item estimates att 
obtained. These item estimates are then used as the initial val? 
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for the iterative procedure described in MAX. The initial values 
for the ability estimates are taken to be zero. 

Using the LOG item estimates and zero ability estimates as 
starting values, the iterative procedure, described by the Newton- 
Raphson equations (25) and (27), is continued until stable esti- 
mates are obtained both for the item and the ability estimates. 

This is accomplished by solving (25) for the item estimates as- 
suming that the abilities are zero. The obtained item estimates are 
substituted in (27) and these equations are solved for improved 
ability estimates. The improved ability estimates are then substi- 
tuted in (25) and improved item estimates obtained. This proced- 
ure of alternately solving (25) and (27) using improved estimates 
at each stage is continued till the process converges. 

Two criteria for convergence were described in the previous sec- 
tion. We use both criteria. First we examine the average squared 
deviation SD and then test the change in the likelihood ELD. If 
either SD or ELD is less than the specified criterion value we stop 
the procedure. The criterion values we use are 10-5 for SD and 10? 
for ELD. We find that these cut-off values ensure sufficient con- 
vergence. When the procedure is continued further no appreciable 
change is observed in the estimates. The FORTRAN programming 
steps required for implementing the successive solutions for (25) 
and (27) are shown below: 


EL=—10E4+6 
DO 11 =1,20 
ELD = EL 
CALL MAXLIK (D, B, AP, SD, K, NGK, R, SC) 
CALL LIKE (D, B, EL, K, NGK, R, IA) 
ELD = EL — ELD 
IF (SD — SC) 2,2,3 
3 IF (ELD — CM) 2,2,1 
1 CONTINUE 
2 CONTINUE 


The log likelihood EL is initialicized at a negative value since it is 
expected to increase. This is necessary to do in order to compute 
the change in the likelihood for the first iteration. The vector B, 
ability estimates, are initially set to zero, and the vector D, item 
estimates, are those obtained from the LOG method. From our ex- 
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perience we find that the maximum number of times we might 
expect to go through this procedure is less than 20, therefore we set 
the maximum index of the loop at 20. SC and CM are the criterion 
values discussed above, e.g. SC = 10% and CM = 10° and 


K = number of items 

NGK = K — 1, the number of score groups 
R = vector of score group sizes 

IA = data matrix in fixed point mode. 


AP is the vector of item scores which can be computed from the 
data matrix as follows: 


| 

k-i 
AP, = ai. 
i 
MAXLIK and LIKE are subroutines. MAXLIK performs the itera- 
tions for the individual sets of equations, ie. for (25) and (27). 
LIKE computes the likelihood. The steps required for these sub- 
routines are indicated below. 

jio ios MAXLIK (D, B, AP, SD, K, NGK, R, SC) 

D-0 
SP=0 


DO1I1=1,K 
PD = D(I 
CALL NEWT (NGK, AP(I), R, B, D(I), SC) 
SD = SD + (D(I) — PD) **2 
1 SP-SP-D(I) 
SP = SP/FLOATF(K) 
DO2T=1,K 
D(I) = D(I) — SP 
2 Z(I) =1. 
D03J-1,NGK 
FJ 
PB = B(J) 
CALL NEWT (K, FJ, 2, D,B 
3 SD =SD + (B(J) — PB)**2 Td 
SD = SD/FLOATF (K + NGK) 
RETURN 
END 


WRIGHT AND PANCHAPAKESAN 39 
It should be noted that, as in the LOG method, here also the item 
k 
estimates are constrained so that they add to zero, i.e. >> d,* = 0. 


Li 
The iterations for the Newton-Raphson method are performed 
in subroutine NEWT. It is a general subroutine and is applicable 
to any equation of the form: 


N 
€ — 3 (A; exp (X + Y,)/( + exp (X + Y9) = 0 
where X = the unknown 
C, and vectors A and Y are given constants. 


The steps required for the programming are shown below: 


SUBROUTINE NEWT (N, C, A, Y, X, SC) 


DO1IT — 1,50 

F= -C 

FP=0 
DO2T=1,N 

P = EXPF (X + Y(I)) 
PP=1.+P 


F = F + A(I) *P/PP 
2 FP = FP + A(I)*P/(PP*PP) 
F = —F/FP 
X=X+F 
IF ((F/X)**2 — SC) 3,3,1 
1 CONTINUE 
3 CONTINUE 
RETURN 
END 


Finally Subroutine LIKE is given below: 


SUBROUTINE LIKE (D,B, EL, K, NGK, R, IA) 
=0 
DO21=1,K 
DO2J = 1, NGK 
IF (R(J)) 2, 2,3 
3 T= BJ) + D(I) 
EL = EL + T*FLOATF (IA(J, I)) — R(J)*LOGF (1. + 
EXPF(T)) 
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2 CONTINUE 
RETURN 
END 


Once the item and ability estimates have been obtained, by the 
procedure described above, the standard error of item estimates 
is easily computed from equation (23). The vector SI of standard 
errors of the item estimates depends mainly upon the number of 
persons in the sample, i.e. the vector R of score group sizes. The 
larger the elements of this vector R, the smaller will be the stand- 
ard errors, The program segment for computing SI is shown below: 


DO11=1,K 
SI(I)-0 
DO2J=1,NGK 
P = EXPF (B(J) + D(IJ) 
PP=1.+P 
2 SI(I) = SI(I) + R(J)*P/(PP*PP) 
1 SI(I) = SQRTF (1./SI(1)) 


Methods 


C. Person Measurement 
1. Ability Estimation: 


This part of the procedure is especially important for test users. 
Ordinarily test users are not concerned with calibrating items 


je 


| 


Given a pool of calibrated items, however, they want to estimate | 


abilities for persons to whom sets of items have been administered. 

As mentioned earlier, if a scoring table is provided with the items 
and all the items used to compute the scoring table are used in the 
test, there is no need to compute new ability estimates. They can 
be obtained immediately by referring to the scoring table. If only 
some of the items are used, however, one needs to compute the 
abilities and their standard errors for scores on this selection 0f 
items. That procedure is given in this section. 

The equations to be solved have been discussed previously (22): 
The only way to solve these implicit equations (22) is by means 0 


- 


an iterative method. The Newton-Raphson procedure gives the 1° 
^ 
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lationship between two successive values of the estimates in terms 
of the functional form of the equation to be solved. This procedure 
was discussed previously (27), but we will restate the equations for 
the convenience of those interested in ability estimation only. 


k 
f= 2 (exp (di + bj)/0 + exp (di +b"), j— 52-1 
j = the score, an estimated ability b,* is associated with each 
score 


d, = the item estimates, assumed known from the calibration 
of the item pool 


k = number of items used for the test. 
basi* = b, — (g(b*)/9' (b) 


00) = j - È (exp (OF + d)/Q + exp (* + dy) 


gà) = -È (exp (* + d)/Q + exp (* + 4) 


b,* — value of the estimate at the nth iteration 
b,a* = value of the estimate at the (n + 1)th iteration 
g(b*)/g'(b*) is evaluated at b* = b,*. 


Since we are solving the equations by means of an iterative 
method, we need some criterion for terminating the procedure. We 
Stop the iterations when SD, the square of the relative change in 
the estimate, is less than some specified value SC. We find that no 
appreciable change is observed in the estimates if the procedure is 
carried on beyond the point when SD becomes less than 10°. 
Therefore, we set SC = 10%, 


À The FORTRAN program segment for this procedure is given be- 
low: 


DO1J=1,NGK 
DO 21T = 1,50 
G=-J 

GP=0 


DO31-1,K 
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P = EXPF (D(I) + B(J)) 


— = 


PP =1.+P 

G=G+P/PP 
3 GP = GP + P/(PP*PP) 

B(J) =B(J) + G 

SD = G/B(J) 

IF (SD**2 — SC) 1, 1,2 f 
2 CONTINUE j 
1 CONTINUE | 


"Thus we obtain an ability estimate for each of the k — 1 scores 1, 
2...k — 1. One advantage of using this metric for the abilities in- 
stead of the observed score is that the scale of this metric is an 
interval scale, whereas, in general the raw score scale is not. An- 
other important consideration is that abilities in this metric, ob- 
tained from different sets of calibrated items, are comparable. In 
the case of the raw score there is no rigorous method of putting the 
score on a common scale, 


2. Standard Error of Ability Estimate: 


The accuracy of any ability measurement is an important con- 
sideration. Not only do we want to be able to measure the ability 
of a person, but we would also like to know how well we have bee? 
able to make the measurement. The major contribution to the error 
variance of the ability estimate comes from the variance in score 
produced by a given individual. As we shall later see, this part af 
the error variance depends upon the number of items and their 
easiness range. Therefore, in designing a measurement, for example 
constructing a test, it will be the accuracy desired which will dete 
mine the number and easiness range of the items selected for th? 
ability estimation. 

A smaller number of items is needed to produce a given level of 
precision in the measurement when the difficulty level of the ite 
is approximately equal to the ability of the person being measured: 
This is similar to choosing items at the fifty per cent level of diffe 
culty in classical item analysis. For a given set of k items the sta” 
dard errors of the ability estimates corresponding to raw scores 
around k/2 will be smaller than the standard errors for the m0! 
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extreme scores near 1 and k — 1. Hence, by choosing items with the 
appropriate difficulties it is possible to economise on the number of 
items administered. 

Another component which makes a small contribution to the 
variance of ability estimates comes from the imprecision in item 
calibration. This effect can be made negligible by calibrating the 
items on large samples so that the standard errors of item estimates 
are very small. 

An approximation of the variance of the ability estimate b* is 
given by: 


VA(b*) = 1/(C(b*) exp (b*) + (1/C*(6*)) 
2 (Vd) (exp (d2/(. + exp @ + D") — Q9) 


where 


CO) =È (exp (d2/ + exp (* + ay) 


V (d;) is the variance of the item calibration di. 

The first term in the right hand side of the expression (29) is 
due to the variance in the score and the second term is due to the 
imprecision of item calibration. The first term is always larger than 
the second. For example, if we assume that all V(d;) are one (usu- 
ally V (dj) is much less than one) the second term is p(1 — p) times 
the first. We know that the maximum value of p(1 — p) is 0.25, 
therefore, the second term will, at the most, contribute one fourth 
as much variance as that due to the uncertainty in the score, in 
other words, at most 20 per cent of the total error variance. The 
Magnitude of the first term depends primarily on the number of 
items, and to a lesser degree on the relationship between their easi- 
Ness range and the ability being measured. 

Given ability estimates, item estimates and their variances we 
can compute the standard errors of the ability estimates by means 
of the following FORTRAN program segment: 


DO1J=1,NGK 
V=0 
C=0 


DO2T=1,K 
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Y(I) = EXPF(D(I))/(. + EXPF(D(I) + B(J)))**2 1 
2 C=C (I) 

DO3I=1,K 
3 V=V + Y(I)*Y(I)*SI(I)'SI(I) 
1 SA(J) = SQRTF (1./(C*EXPF(B(J)) + V/C**2) 


SA vector of standard errors of ability estimates 

K — number of items 

B = vector of ability estimates ^ 
D = vector of item estimates } 
SI = vector of standard errors of item estimates. | 


D. Testing the Fit of the Item: 


During item calibration it is necessary to decide whether all the 
items that have been tried are to be retained for the final pool. We 
need a statistical criterion for deciding whether an item is good 
enough from the point of view of the model. 

To make this decision we need to investigate how the elements 
ay in the data matrix A depend upon the estimates d,* and b;*. If 
we can derive the expectation E (a) of these elements in terms of 
the obtained estimates we can form a standard deviate 


yi = (a — E(a;)))/(V (a,))? (a) 
and use this deviate as the basis for a test of item fit, If item i fils 
the model, and the score group 7; is large enough, then y; will have 
an approximately unit normal distribution. 

Now aj has a binomial distribution with parameters py, the prob | 
ability of making a correct response, and rj, the number of person 
with a score j. Therefore, the expectation of aj is given by: 


Elai) = Tipu = r; exp (b; + d;)/(1 + exp (b; + d)) (3) 
and its variance by 


{ Vla) = "ip(l — pr). ; 
Since b; and d, are not known we use their estimates and approx 
mate the expectation and variance of ay as 


a E(a,) = rpu* = r; exp (bj* + d,*)/( + exp (b;* + d) 
an 
Vlar) =r (l — p,*). 


Examination of the matrix Y, with the standard deviates ys ™ 
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elements, will show us how well the items fit, and indicate where 
there are signs of misfit. 

From the matrix Y we can obtain statistics which will enable us 
to evaluate the fit of the model to the data as a whole, and we can 
also form approximate statistics which will help identify items 
which are bad, and hence need to be reconsidered. As discussed in 
the introduction, an item may not fit for a number of reasons. It 
may be badly constructed or incorrectly scored, Its discrimination 
may be very different from the discriminations of the other items. 
It could be measuring some ability other than that being measured . 
by the rest of the items. In any case, the item will be detected so 
that it can be examined for deletion or revision. 

The over-all statistic used in the procedure is a chi-square statis- 
tic x? which is obtained by summing the squared unit normal de- 
viates over the entire matrix Y 


x= - ye (32) 


with ^ degrees of freedom = (k — 1)(m — 1) 
Where m = number of score groups with r; = 0. 

The degrees of freedom are obtained from the number of obser- 
vations in the data matrix, taking account of the loss of degrees 
of freedom due to constraints and parameter estimation. There are 
k X m observations in the data matrix. There are m constraints on 
the score margins since Stay = jr; Finally (k — 1) item param- 
eters have been estimated. Therefore the degrees of freedom for x* 
are: 


df. = km — m — (k — 1) (33) 


= (m — 1)(k — 1). 
An approximate x2 statistic can also be obtained for each item 
by summing y,? over the score groups to give 


k-1 
xv = DD: ye (34) 
with Piss 
djf.=m-—1. 


Since (34) is an approximate x2, we do not think it advisable to 
Mechanically delete all items for which the x2 is significant at some 
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level. We prefer to examine in detail items for which xi? is large, 
This may mean evaluating the possible effects of discrimination 
and guessing in these “bad” items. Then when we have decided 
which of the “bad” items to delete, we rerun the analysis to see hoy 
the remaining set of items look. 

A FORTRAN program segment which will implement the pro- 
cedure in this section is given below: 


NGK2K-1 
CH=0 | 
DO1I=1,K 
DO2J = 1, NGK 
IF (R(J)) 4,4,3 

4 Y(J,1) =0 
GO TO 2 


e 


T = EXPF(D(I) + B(J)) | 
KM DeC T)*FLOATF(IA(J, I)) — R(J)*T)/ 
SQRTF(R(J)*T) 
2 CONTINUE 
1 CONTINUE 
DO5I=1,K 
CHI(I) =0 
DO6J =1,NGK 
6 CHI(I) = CHI(I) + Y(J, I)**2 
CH = CH + CHI(I) 
5 CHI(I) = CHI(I)/FLOATF(M — 1) 
CH S CH/FLOATF((M — 1)*NGK) 
CH = mean square for the entire data. 
CHI — vector of item mean Squares, 
R = vector of score group sizes, 
M = number of occupied score groups with T, 7- 0. 
IA — data matrix. 
K = number of items, 
D = vector of item estimates, 
B — vector of ability estimates. 


A (FORTRAN II) PROGRAM FOR SAMPLE-FREE 
ITEM ANALYSIS 


This program estimates item and ability parameters from itet | 
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analysis data according to the logistic response model: 
Pr(a,i) = (Z,E;)"*/(. + Z,E:) 
= exp (a,;(b, + d:))/(L + exp (b, + dj) 
. Program language is FORTRAN II. 
The program operates under the FASTRAN compiler of the 
University of Chicago IBM 7094/7040 System. It may be neces- 


sary to modify input, output, clock, and date routines to make 
it operate elsewhere. 


mÓ 


I 


. Program capacity is 150 items and 100,000 persons. 


. Input is a sample of person response vectors either punched 
1 = right and 0 = wrong or read with a scoring key in response 
vector format. 


eo 


A 


. Output 
the test: Kuder-Richardson Formula 20 Test Reliability 


each ilem: item score — per cent persons passing item 
log easiness —ability intercept of item charac- 
teristic curve at median response 
discrimination —slope of item characteristic curve 
at median response 
reliability —point biserial correlation between 
item response and estimated ability 
fit to model —probability of observed responses 
if item fits model 
standard errors of log easiness and discrimination 


each score: sample frequency at that score 
sample percentile through that score 
log ability estimated on an interval scale 
standard error of log ability estimate 
raw ability estimated on a ratio scale 
confidence boundaries for raw ability 


Compilation listings and source decks are available at cost from: 
Benjamin Wright, University of Chicago, 5835 Kimbark Avenue, 
‘Chicago, Ill. 60637. 
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AN ITEM SAMPLING MODEL FOR THE RELIABILITY OF 
COMPOSITE TESTS 


DONALD W. ZIMMERMAN 


Carleton University 
Ottawa, Canada 


Tus paper considers three quantities which are lower bounds on 
the reliability coefficient—the KR20 formula, the KR21 formula 
(Kuder and Richardson, 1937), and Guttman’s A, (Guttman, 1945). 
Its purpose is to establish certain unifying relationships among these 
three quantities and some other results which follow from a model 
for the reliability of tests consisting of dichotomous items. 

First, a formulation of reliability consistent with the classical 
test theory model as presented by Lord and Novick (1968) will be 
given. A model for the sampling of dichotomous test items will be re- 
lated to that formulation. Then, necessary and sufficient conditions 
under which the above quantities are equal to test reliability will be 
established and interpreted in the context of the model. These three 
reliability formulas turn out to be of special interest when considered 
together because of a symmetry underlying the conditions for their 
validity. 


Notation, Definitions, and Preliminary Theorems 


The subscript 7, taking on values from 1 to N, will refer to items. 
The subscript j, taking on values from 1 to K, will refer to persons. 
The subscript g, taking on values from 1 to H, will refer to repeated 
Measurements on a given person. Expectations, variances, and co- 
Variances will be denoted by E, c^, and COV, with subscripts in- 
dicating the variables over which these values are taken. 

We consider a set of pairs of independent repeated measurements 
having the same distribution function. One member of each pair is 
denoted by X,,,.. the other by X ,;/". Total test scores are X,, and Xj’. 
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The quantity X,;; is considered to be a random variable having 
expectation with respect to the distribution over repeated measure 
ments EX, and variance OKs. Also, EX,/ and c^X,,,' are de 


fined in the same way. The ‘quantities EX, EM i eX, and oxy ‘ 


are the expectations and variances with respect to “the distribution 
over repeated measurements of total test scores for person j. 
Assume that these measurements are unit-dichotomous items and 
that the random variable X,,; takes on the value 1 with probability 
pi; and the value 0 with probability 1 — p;;. Then, EX,;; = pi 


A possible interpretation of the quantity p,; is the proportion of © 


items in a population, or pool, ? known by person j, and a possible 
interpretation of the set of independent repeated measurements is the 
repeated random selection of an item from that population. It should 
be noted that under this interpretation NK distinet pools are cor- 
sidered, each having a value p;; on which no restrictions are placed 
other than 0 < p,, < 1. 

Total observed variance is 


Xu ay EX, Ex: (E X,)*, 


where the symbol Sy refers to the expectation over persons and Ie 
peated measurements, Similar expressions can be written for c c "X 
and also for observed item variances, c^X,,;; and c Aul . 


m 
Total observed covariance between pairs of "repeated measure 
ments is 


COVX,, X, = EX X, — (BX, E Xp’). 
D 9, i pi 
The reliability of a test is defined as follows: 


COYX,,, X,/ 


B p XU 


The reliability of an item, p,, is defined in the same way, in terms ú 
X,;; and X. 

Since the distribution functions of X,, and X,,/ are the same fo! 
any person j, EX, x EX, and 7 Xu = 0 eX It is not assum 


that these distribution funetions are the same for eaerent persons 


The same concept applies to the item Scores, X,;; and Xari” 


| 
| 
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It follows from the above that X, 4 = 6X4 CEX,; = EX, 

and p(EX,;, EX,;) = 1, where Kits: Heine the product monte 

porrelation between EX,; and EX,,’. The same relationships hold 
s s 


for item scores, 
Since repeated measurements are independent, EX,;X,/ = 
g 


EX,EX,; 


and 
COV(X,;, X,/) = E(EX,EX,/) — HEX, EEX, ;' 
si LIA I " ig ig 
= COV(EX,;, EX,;') = c'EX,;. 
i 9 0 ig 
Also, 
eX, = EBX,’ — (HEX, 
si io io 
eX, = EX, — (EX, ;)° 
LJ g LI 
Eo'X,; = EEX,! — E(EX, y 
io ic io 
and 
EX, = B(EX,,)° — (EBX, ;)’. 
io ig ic 
Therefore, 


v X, = EX, T o EX. 
si ig io 


Using these results together with the definition of reliability, it 
follows that 


o EX,; EX, D 
Ls = dg y 
p mp 1 des 
si LH 


An equation for the reliability of an item, p;, can be derived in the 
Same way, 

An alternative followed by Guttman (1945) uses (1) as a definition 
of test reliability. Then, the definition given in the present paper can 
be derived as a theorem. See also Zimmerman, Williams and Burk- 
heimer (1968). 

The following symbols will also be used: 
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È Pn (Same as KR20, pho as 
a= 1] — pi Guttman's A, 
N-I X, 
Mrs) 
y peu TRA Same as KR21) 
B= RET 1 NS X, ( eas 
od 
PO 2 o Xa (Same as Guttman’s X) 
21-4; 7 
X, 
RE 


The derivation in this paper determines necessary and sufficient 
conditions under which a, 8, and à, respectively, are equal to p, using 
the model described above. These quantities are lower bounds on p. 
In the present context, however, each of these quantities is equal to 
p only when the p;; values of the NK item populations are subject 
to certain restrictions. 


Derivation of Equations for p 


The variance of total test scores for person j with respect to the 
distribution of scores over repeated measurements is 


eX, = NEEX,,(1 — EEX,;) — No’ EX,;; 
7 to io to (2) 
= NEEX,,, — N(EEX,,y — EX, 

ig sg to 


The selection of N items for a test from distinct item populations can 

be regarded as a Poisson sequence of trials with 5 = EEX,; and 
io 

ey = EX oi since the value of p,, changes from one item to 

another. Equation (2), then, gives the variance of the total number 


of “successes” in a Sequence of N trials, where the probability of 
“success” is Piy. The expectation over all K persons of the variance 
given by equation (2) is 
Eo°X,; = NEEX,,, — NE(EEX,;,)? — NECEX,, [ORE 
fo iig i io fig | 
Now, 
EEX, = E(EEX, Y — (EEEX,,,), 
i ég $ ig iio 


and 
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cEEX,  E(EEX,) — (EEEX Y. 
tig tig tig 
Since 
(EEEX, ;;)’ m (EEEX, ;.)’, 
iic tio 
E(EEX,;) = EEEX p) — EEX, + 0 BEX,;:. 
ito $ de tio iio 
Therefore, 


Ec'X,, = NEEX,,; — NE(EEX,, y 
jo dic tig 


+ No EEX,;; — No EEX,;; = NEoEX,;;.. 
tio fie ites 


Substituting this result in (1) gives 


NEEEX,;; — NE(EEX,,;,;)° 
p "j Lopes 


(4) 


(5) 


e X, 
v.d 
NEcCEX,; — No EEX, ;: No*EEX 1: 
á 4 4 
xs X, + eX, (6) 
od od 
Now, 
N. 
X, ad à Xa 
PT NEX, 
4 
Therefore, 
N 
iS X, — NECEX,, — No’ EEX, No'EEX s 
=] iLe iic tio iio l 
m. v Xy n og Ag + o X, 0 
ed od of 
Since 
EX; PEX, 
EEX,; = a i EEX = CET 
Therefore, 
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Substituting p/N in (7) and solving for p gives 


D j 
D oX,  NECSEX,, — No EEX, ;; 
å Li g 


uU 1 — ig 
AN oX PX, 
si si 
Also, since 
EEX, = EEEX.) — (EEEX „° 
fig s 
and 
E(EEX,;) = e'EEX,,, + (EEEX,,', 
tio fio tio 
it follows that 7 


NEEEX,;, — NE(EEX,;,)° 
iig fio 


= NEEEX,;, — N(EEEX,,,)° — No’ EEX 
fio tie sin 


EX, 
—OEX,—N*t'L-— — No FEX,,;; 
od N tio 
EX,(N — EX,) 
aput 2 4 
— OS 
and that 
N EX,(N — EX,)  NEcEX,; 
p= == ]1 E. 
N-1 No’ X,, v X, 
si si 
Since 


2 
o EX m EEX e o* EEX 


7OECEX, + EEX, i, 
fig tig 


EEX, = NEo*EX,,, + No*BEX,,, — No*EEX,,:- 
Uys fio dite tie 


Substituting in (7) gives 


Conditions under Which a, B and X equal p 


(10) 
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From (8) it is evident that 
a = pe EEX: = &EEX,;, 
i6s tie 


‘from (9) that 
B = pe Eo EX,,;; = 0, 
ite 


and from (10) that 
= p> Eo’ EX,;; = 0. 
tig 
From the equations expressing OLX 5s as sums of components it 
follows that bes 
Ho" BX a = EEX s S BEX = o' EEX aii» 


iic 


Also, 
Eo EX,;; = 0 = o HEX,;; = 0 
ito tio 
and 
Eo EX,;; = 0€ c EEX,;; = 0. 
! mm ite 
Finally, since 
PEX, 
c LEX; Lor TENTE 
ito 


EEX, u = 0 & p= 0. 
i ig 


Therefore, to summarize 
a = pe E”EX,: = EEX 
iso 


tig 
and 
a = p & Eo EX, = o EEX,- 
tig iie 
B=p e Eo’ EX,;: = cEEX, =9 
ito tio 
and 
X= pS EEX, n = EEX, = 0. 
tio iio 
Hence, 


B=p—-a=p 
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and 
" X= p—ac p. 


However, œ = p does not imply either that 8 = p or that \ = p 
Further, 


A-pep-0. 


Interpretation in Terms of Matriz of EX,,, Values 
s | 


Consider, now, the interpretation of these conditions in terms of 
the matrix of EX values, which is the same as the matrix of pi 


values under the item sampling model. 
Let BX,;; = pi; = A; + Bj, where A, is invariant over j and B; 
A 
is invariant over 7. Then, 


Ec'EX,, = Ec'(A, + Bj) 
iig ii 


= Eð A, 
ii 
= 0°A, 
Also, $ 
EEX, = PECA: + B) 
tio ii f 
A ea. T EB;) 
i 
=0°A, 
4 
Similarly, 
BEX, = Er(A, + B) | 
= Ec'B, 
ii 
= cB, 
i 
and 


(IST = VEA, + B) 
= €, + B) 
- oB; 
i 


DONALD W. ZIMMERMAN 57 


Therefore, í 
if EX,, = A; + By, forall ij, a= p 


if EX,,;,=B,, forall j, B=p 


and 
if EX,,— Aj, forall 7, X= p. 
? 


These conditions can be expressed as follows. Considering a matrix in 
which rows represent persons and columns represent items, a = p 
if the expected observed score of each person on each item is the sum 
of a row effect and a column effect—‘‘ability” and "item difficulty” — 
which do not interact. Also, 8 = p if the expected observed score of 
each person on each item is determined only by a row effect—that is, 
item difficulty is invariant. And ^ = p if the expected observed score 
of each person on each item is determined only by a column effect— 
that is, ability is invariant. Let 


and let 


It is known that a — B = £. The present derivation indicates that 
p =a + A — B and that p = B + A. These relations make it clear 
that p = a & A = B and that p = 8 €» A = 0. 

Discussion 

The meaning of the above results may be clarified by the following 
points, 

1. The results derived in this paper are consequences of the defini- 
tion of test reliability, together with the stipulation that measures 
take on possible values 1 and 0 with probability p, and 1 — pj. The 
equations derived in this paper and the conditions for the three 
reliability formulas, then, follow from relatively weak assumptions 
and hence are generally applicable. Further identification of the set 
of repeated measurements with selection of items from a population 
I$ à suggested possibility. 
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Lord's original model (1956) identified repeated measurements | 
with the process of selecting the N items of a test from a commo 
pool, having a single p; value for each person j. Extension of thẹ 
same concept to distinct p;; values resolves a point raised by Lyerly 
(1958) concerning the applicability of the sampling model As 
pointed out by Lyerly, Lord’s model assumes that the selection of 
each item from a pool is an independent trial in the Bernoullin | 
sense. In the present model, the selection of N items is a Poisson 
sequence of trials. Whether the KR20 formula or the KR21 formula ! 
is equal to p is determined from the matrix of p;, values. In the | 
Bernoullian scheme both o and 8 are equal to p, since in that cae | 
EEX, u = EEX, =0. 


fig io 
2. In the present paper test reliability is defined first as a product 
moment correlation coefficient (a ratio of a covariance to the ge 
metric mean of two variances) and, then, an equivalent statement a 
a ratio of two variances is derived as a theorem. None of the results 
in the present paper would be changed by taking equation (1) asa 
definition of reliability. The method used in this paper has certain 
additional advantages, not related to the present derivation 
example, it leads to a very direct derivation of attenuation formulas 
3. A method similar to that used in this paper can be used to 
derive o for the case of a test consisting of N parts, not necessarily 
unit-dichotomous items. "Then, necessary and sufficient conditio 


for æ are found to be the same as those given in this paper and al 
the same as those determined by Novick and Lewis (1967). See als 
Zimmerman and Burkheimer (1968). That is, the conditions deriv 
in the present paper, that Ho*EX,,, = EEX, are equivalent t0 


£ iig iio 

those derived by Novick and Lewis for coefficient alpha in terms of 
equivalent Measurements. A similar derivation for the more gent 
case is also readily made for ^. However, the derivation of 8 depen 
upon the assumption of unit-dichotomous items. The Poisson salt 
pling scheme 18 not applicable to the general ease, but a sampliié 
interpretation of the random variable X,,, is possible even when it 
1s not restricted to the values 1 and 0. 

4. Equation (8) in this paper is of some interest in its own right 
It gives the reliability of any composite test without restrictions ” 
the EX, values. However, it cannot be used to estimate reliabilil! 


from sample data (a single administration of a test) as can a, B, and" 
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The equation gives the parameter value of p as a function of the 
parameter values of variances. But for sample values the right hand 
of the equation is equal to unity. As the sample estimates of the 
variances approach their parameter values, the value of p given by 
the equation approaches a value less than unity. 

5. It is possible to call the quantity EX,;; “true score", or T. 


This identification does not change the derivation in this paper except 
for the substitution of one symbol for another. The advantages of 
the present method might be considered. At several points the present 
notation makes clear some relationships among expectations and 
variances which might be obscured by the T', notation. At other 
points there is perhaps complexity which could be reduced by a more 
compact symbol in place of EX, It should also be noted that the 


quantity X,,, — EX ji that is, “error score'—has not been needed 
at all in the present derivation, although o°X,;:, which is equal to 
4 
s — EX,,;), does enter into the derivation. 
i 
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AN INVESTIGATION OF EMPIRICAL SAMPLING , 
DISTRIBUTIONS OF CORRELATION COEFFICIENTS 
CORRECTED FOR ATTENUATION 


ROBERT A. FORSYTH Ax» LEONARD S. FELDT 
University of Iowa 


Tar appropriate applications, idiosynerasies, and potential 
misuses of correlations corrected for attenuation have been the 
subject of debate for over fifty years. A number of prominent re- 
search workers, including Gulliksen (1950), Lord, (1957), Block 
(1963), Guilford (1965), and Glass and Taylor (1966), find a variety of 
experimental situations in which such coefficients serve useful pur- 
poses. Others, however, appear to regard corrected correlations with 
some mistrust and have tended to discourage their use (A.P.A., 
1966). 

The concern of this latter group stems primarily from two con- 
siderations. First, it is believed that such coefficients can be and 
often are misconstrued by unsophisticated consumers as simple 
correlations between observed scores. The fact that an occasional 
value has been found to exceed 1.0 has created the impression that 
correction formulas may somehow be invalid. Second, very little 
is known about the sampling distribution of a corrected coefficient. 
In the absence of a sampling theory, even the most knowledgeable 
researcher is restricted in his use of these statistics. 

Sampling theory for the corrected r is limited to approximate 
formulas for its standard error. These were derived by Shen (1924), 
Cureton (1936), and Kelley (1947). More recently, Du Bois (1957, 
1965) has shown that the population value of the corrected coef- 
ficient can be viewed as a partial correlation between observed 
Scores with errors of measurement held constant. But this conceptuali- 
zation does not mean that the sample corrected coefficients are 
distributed as partial correlations. In the realm of hypothesis testing 
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Lord (1957) and McNemar (1958) have derived techniques fo 
testing the hypothesis that the population value of the corrected 
coefficient equals 1.0. Neither of these tests has been generalized to 
other hypotheses, however. More important, perhaps, these tech ' 
niques do not lead to procedures for establishing confidence interval | 
Because little is known about the form of the sampling distributio, 
the existing standard error formulas cannot be used as a basis fo 
such intervals. ! 

The primary purposes of this investigation were to genera | 
empirical sampling distributions of coefficients corrected for ur . 
reliability in both variables, to derive techniques for testing hy | 
potheses about the population value, and to develop procedures fot | 
obtaining confidence intervals. The empirical distributions wer 
produced on an IBM 7044 computer, using procedures describel 
in the following section. The testing and confidence procedures 
were deduced largely on a pragmatic basis from the empirical dats. 
As a consequence, these procedures are necessarily approximate and 
do not constitute an analytical solution to the problems of estimation 
and testing. 


Procedures 


A computer program was developed to produce normally dis 
tributed scores 2, and z, and y, and y, with zero mean and wt 
variance and with pre-determined linear correlations ps., Priv 9! 
Pa (Div. = Perse = Parni = pay, = Pzx). In effect, the computer gent 
ated scores on two parallel forms of Test X, with known reliability 
Passa and scores on two parallel forms of Test Y, with known 
liability p,,,,. The population value of the corrected coefficient (h/ 
was also known, since 


z 


Pu = Pry " 
V Diss Purvs 


For a given combination of p,..,, py,y,) and Pan the scores zi, 4d 
and y, were generated for each of N hypothetical individuals 
the sample corrected coefficient (ru) computed. The process 
repeated 1000 times with the same parameters to obtain an empii 
sampling distribution for that combination of reliabilities and intel 
correlation. Three different formulas for T. were investigated, : 
since the results were highly similar for all three, detailed a 
were made only for the following estimator: 
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res T (zi zi) Crib) s 
2r vive () 
L + ree. ld Toss 

In effect, the scores were considered as half-test scores, with re- 
liabilities estimated by the Spearman-Brown formula, Sampling 
distributions were generated for three values of N (50, 100, 200), 
four values of p,, (1.0, .8, .5, .3) and six combinations of pzz, and 
Puy, (0, 8; .9, .6; .8, .7; .7, .9; .8, 8; .8, .9). Thus, empirical distribu- 
tions were produced for 72 combinations of N, pi; pas, aNd Privy 
A detailed frequency distribution and the first four moments of 
each distribution were obtained. Because of the sheer bulk of these 

data, only representative results will be summarized. 

It should be noted that this method of generating the data for r,, 
obviates consideration of the problems of experimental design which 
must be resolved in an actual experiment. In effect, each set of 
reliabilities and intercorrelation is assumed to incorporate a con- 
sistent definition of errors of measurement. In an actual investiga- 
tion, of course, the experimenter must determine which correlation 
and reliability estimation procedures represent such a consistent 
definition. These issues have been discussed at some length by 
Cureton (1965). 

The remaining part of this paper is divided into four sections. The 
first describes the empirical sampling distributions of Te and com- 
pares these with the sampling distributions of product-moment 
coefficients. This section also discusses the validity of available 
standard error formulas for r,,. The second section describes a 
technique developed to establish confidence intervals for p and 
to test hypotheses about p,,. The final two sections present cross- 
validation results for this technique. 


Characteristics of the Sampling Distribution of t+ 

The means of the sampling distributions for p, < 1.0 tended to be 
lower than the parameter value. The bias in re was very slight, 
however. In the case of N = 100, for example, the difference be- 
tween 7,, and p,, was in the neighborhood of —.001. When p, = 1.0, 
this bias disappeared, and there was some suggestion that it was 
reversed. In neither case did the bias appear to be of any significant 
magnitude. 

Although the variance of the r values proved to be systematically 
greater than that predicted by the formulas of Kelley (1947) and 
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| 
Cureton (1936), the differences were not very large. The Kelley 
formula proved to be slightly more accurate than the Cureton form- 
ula, and was used in all subsequent analyses described below. This 
formula is: 

4 


2 fe 2 
Sab: erp (N — 2) [an i Tero 
Tras. Ty. vs Trani Turys 

For each of the 72 distributions an empirical estimate of c,,,* was 
obtained from the 1000 values of r,,. The Kelley formula, using 
parameters in place of the statistics of Equation (2), was also evalu- 
ated for the 72 distributions. The mean absolute deviation of the 72 
empirical estimates of z,,,* from the values given by Kelley’s formula 
was .00031. 

Since it is not immediately obvious from Equation (2), it should be 
noted that a particular combination of reliabilities has a greater 
effect on the c,,,* value for high values of p, than for low values of 
Pes. For example, with N = 100 and p,, = 1.0 the ratio of the greatest j 
Fr, value (p,,., = .6 and p,,y, = .8) to the lowest c,,," value (gas = 
8 and p,,,, = .9) was approximately 6.6 to 1. However, when pu = 
:30 and N = 100 the ratio of these two ø,,,? values was about 1 
tol. 

Corrected correlations are considerably more variable tho, 
product-moment correlations, For example, with N = 100 and | 
Pu = .3, the standard deviation of r,, ranged between .1046 anl 
-1171, depending on the combination of reliabilities and intercorrelk | 
tion. Product-moment correlations with Pay = .3 have a stand 
error of .0917, as computed by formulas supplied by Hotelling 
(1953). With p,, = .8, the standard deviations ranged between 048 
and .0669. Product-moment correlations with pa, = .8 have a sta 
dard error of .0368. With Pu = 1.0 and N = 100, the stan 
deviation of r,, ranged from .0169 (with pas, = .8 and p,,, = 9) 
0452 (with p, = .6 and p,,,, = .8). When p,, = 1.0 the samp 
product-moment correlation evidences no sampling error, of cou 

The skewness of r,, distributions approximates that of produt 
moment correlations when p,, is relatively low, say .3. For b° 
statistics the skew is negative. Up to p, = .8 the skewness of ™ 
tended to increase, but not as rapidly as that of distributions of ™ 

For example, with N = 100, Poze = 8, pyy, = Sand py, = È 
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skewness of r,,, as reckoned by the average cubed deviation from the 
mean divided by the cube of the standard deviation, was —.3061. 
| The same quantity equals —.4902 for product-moment correlations 
from a population p,, = .8. At p,, = 1.0 this trend of increasing 
skewness had reversed, and the distributions of r,, tended to be 
slightly positively skewed. 
Like product-moment correlations, r,, has a sampling distribution 
-that is leptokurtic for population values above .3. Kurtosis indexes 
tended to range from close to 3.0 for low values of p, to 3.4 for the 
distributions generated under p,, — 1.0. In general, the values of the 
kurtosis indexes were close to those of product-moment coefficients 
for pu = .3, .5, and .8. 


Techniques for Establishing Confidence Intervals for p,, and for Testing 
Hypotheses about pr 


In view of the differences between the sampling distribution of 
Corrected correlations and that of product-moment coefficients, it 
seemed obvious that Fisher’s z-transformation would not normalize 
the distribution of r,,. A number of alternative techniques for testing 
hypotheses and defining confidence intervals was, therefore, con- 
sidered, three of which proved reasonably valid. The test techniques 
gave control of Type I error that was close to the nominal levels 
employed (.05 and .10) and increasing power with increasing falsity 
in the hypothesis. The interval estimation procedures were shown 
to produce intervals with empirical probabilities close to the nom- 
inal confidence coefficients adopted (.90 and .95). While the three 
techniques appeared equally valid on statistical grounds, one method 
~based on normal distribution theory—had the advantage of 
Simplicity. For this reason, only this method, with the evidence for 
its validity, will be presented. 

The proposed method of defining confidence intervals and testing 
hypotheses is based on the finding that, for suitably large samples, 
th Sampling distribution of 7,, is approximately normal in form. 

; With an appropriate estimate of the standard error (êr) the 
100, per cent confidence interval for p: is defined as follows: 


Upper Bound = ry, + zna-»6n 
Lower Bound = ry — za«06ne 


Where z,/,4..,, = the value of the normal deviate with 
1(1 — 4) of the area below it. 
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It was determined, entirely on empirical grounds, that a satis 
factory estimate of the standard error for this purpose was obtained 
by evaluating Kelley's formula for c,,,, using the obtained statistics, 
and multiplying the result by »/N/N — 2. 

This procedure may be illustrated with the following data: N = 
100, ru = .68, Terisa = .78, and r,,,, = .66. With these values ¢,, 
[formula (2) of the present paper] equals .0761; 4//100/98 ĉn: = 
.0769. The 90 per cent confidence interval for p,, for this illustrative 
example is .68 +: (1.645) (.0769), or .55 to .81. 

When this method is utilized to test hypotheses about pn, the 
hypothesized value of p,,(sp,,) is used in the standard error formula 
in place of r,,. Thus, formula (2) becomes: 


2 
y ; 2 HDu [ 1 = 
Ms 4(N — 2) datu + HY (sina) rs va) 


1 — d M 0 


Tun 5 Tay. Vases vive 
where A 4 
Haired n = TEELS 


The boundaries of the critical region for any hypothesized value d 
Pi ATE Prt  21/20(u6,,,). The region of rejection, of course, includes 
yaus Of r, outside these boundaries. To illustrate this testing 
technique, consider the data for the example above. Assume that itis 
iud n € the hypothesis that p, = .80. When zp: is used it 
e a TEET formula (i.e., formula 3) its value is approximatelj 
137; /100/98(né,,,) & .138, "Therefore, if a = .05, rı, must ex ! 


8 + (1.96)(.138) = .97 or fall bel £ s) = 5 
UN Mn ow .8 — (1.96)(.138) | 


Validity of the Proposed Procedure for Establishing Confidence I ler 


had been generated with N = 50, 1000 samples with N = 100, 


Ten and a sample value of r,,. The data for each sample were U 


Pigeon Mec intervals for p,,, the first with y = 9 
second with y — .95. For all 1000 samples of a given size 
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proportion of intervals which included p; was then determined. 
Table 1 presents the results of this validation procedure. 


TABLE 1 
Proportion ($) of Empirical Confidence Intervals Enclosing pu 
N = 50 N = 100 N = 200 

DP y-290y295 y-290y-7 95 y= .907 = 95 
THER 8 908 .948 905 .947 918 .958 
8 9 6 913 .949 907 .949 910 .944 
18^ —.8 M 889 .935 902 .941 921 .956 
indt .9 902 .944 898 .942 .903 .952 
8 0.8 8 901 .946 916 .954 .896 .950 
19 5.8 9 889 .927 889 .936 .887 946 
2 06 8 901 .952 895 .942 .907 .950 
B 0.9 6 881 .937 901 .943 .897 .945 
5 8 nite 890 .928 892 .949 .903 944 
Ino. 9 894 .935 891 .936 .880 .934 
Uu .8 8 892 .930 910 .946 .896 .955 
S 8 9 904 .937 887 944 .906 .951 
3 6 8 895 .947 ,945 .885 .937 
So ..9 6 888 .931 901 .942 .892 .937 
3 ..8 +7 876 .923 896 .946 .891 id 
eee. .9 880 .932 898 .948 894 944 
3.8 8 877 .924 908 .956 .897 .952 
3 8 9 881 .932 878 .926 .896 .946 

Mean .802 .937 -900 944 .900 .947 
D: = Hic — b .01 — .014 .007 .007 .009  .006 


ALCUN DUN LOE ot NET Ss SE 


The data in Table 1 indicate that the normal-curve technique works 
quite well, particularly with N = 100 and N = 200. The means of 
the empirical y’s for each sample size are approximately equal to the 
nominal y. Even with N = 50, the confidence coefficient appears to 
be close to that desired. Furthermore, the average absolute deviation 
of these estimates from the nominal y-value is less than .008 when 
N = 100 and N = 200. For N = 50, the average absolute deviation 
is slightly above .01. Thus, the data in Table 1 supply substantial 
evidence to support the validity of the proposed confidence interval 
technique. 


Validity of the Proposed Procedures for Testing Hypotheses about pr. 


Although there are relatively few practical situations in which 
hypotheses other than p,, = 1.0 might be tested, evidence bearing on 
the validity of the proposed hypothesis testing technique was gath- 
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ered in this study. For each of the 1000 samples generated under a 
given set of parameters, a region of rejection was defined for testing 
npu. The hypothesized value in each case was true. That is, yp, 
was taken as the value actually used in generating the sample. 
(Because each sample yielded its own observed reliabilities, the 
estimated standard error of r,, and hence the region of rejection varied 
from sample to sample.) A test with a = .05 and a = .10 was con- 
ducted for each sample, and the proportion of rejections (Type I 
errors) was tabulated for the 1000 tests at each level. A two-tailed 
test was made in all cases except those involving the hypothesis of 
1.0. In this case, the alternative hypotheses include only the set 
pu < 1.0, and hence the critical region was placed entirely at the 
lower end of the hypothesized sampling distribution. The results of 
these validation procedures are given in Table 2. 


TABLE 2 
Empirical Estimates of « 


N = 50 N = 100 N = 200 


> 


"o Pho Pus = 05e=.10 a=.05a=.]0 a= 05a= 10 


ROV EARLS .078 119 1001 315 .053 1 
3508/00576. .095 — .142 .075 — .110 .073 -116 
Tee Bb T, 063.007 .076 — .119 .059 115 
TON Deg .077 — .129 .076 — .115 .067  .120 
HOP) 8395,8 .085 — .121 .069 — .116 .0604 — .1M 
d'Or ere .076 — .114 .053  .114 .004  .10 
2L. LR .035  .084 .048  .008 .036 .080 
PRATI .043  .081 .035 ^ .089 .043  .095 
EBA pice .044 — .085 .048  .096 .034  .008 
SLOT RO .034 — .073 .052 . .102 .038  .092 
igi. ig Ng .045 ^ .080 :040 — 078 .056 .096 
Seka AM g .050 ^ .094 .046 — .098 .050  .108 
Bates s iSo a r0.0447..02093-.....080,- 1096 047 — .087 
9 6 051. 071 030.095 03 0 
"bU Eg deri Hi LARN 1040 — 090 .041 100 
:8 399 90 diocl 044) i Monet 1052 — 1007 .002  .108 
PETE .042 092 .041 — 092 .045 .09 
Baud OS cn Ore TAA .040 — .090 
87:0 PRU toast he oe Giger ss 10 
3 ae .048 — .005 Mic el -102 
3!) AAE TEN MOBE OTIO E Dn 1.009 .047 M 
S. E OAS add NU DE 00s. . 049 -10 
8 o8 GB Cos 108087 Prom "er 0 
3) S .046 — 004 .053  .116 M7 45 
Mean .056 09 Si, 01 

D; = là; —a| D .013 016 ‘Oto “oe ‘ooo 0u 


— im." ee 
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The data of Table 2 indicate that the empirical a-values do not, 
in general, differ markedly from the nominal a-level. There is, how- 
ever, a definite tendency for the true level of significance to be higher 
than the nominal level when p,, — 1.0 and to be slightly lower than 
the nominal level when p,, < 1.0. The first of these trends is more 
pronounced, especially with samples as small as 50. 

Aside from the data for p,, = 1.0, the empirical estimates of Type I 
error included in Table 2 are reasonably close to the nominal levels of 
significance. When p,, « 1.0, the test tends toward more strict control 
of Type I error than the nominal æ suggests. That is, with a nominal 
a of .10 or .05, the test tends to be slightly conservative. The mean 
absolute deviation from the nominal a-level ranged from .009 to 
016, 

It should be noted that the obtained estimates of a, like all sample 
proportions, are subject to sampling error. With a true a in the 
neighborhood of .05, the standard error of an empirical estimate 
based on 1000 repetitions of the test is about .007. With a true o in 
the neighborhood of .10, the standard error is about .009. In view of 
these standard errors, the third decimal place for the individual 
estimates of æ is not trustworthy. 

An assessment of the usefulness of the normal curve procedure in 
testing the hypothesis p,, = 1.0 can be made by comparing these 
results with those obtained via MeNemar's procedure for testing the 
same hypothesis. Table 3 contains empirical estimates of o obtained 
for both of these procedures for the eighteen distributions of rj 
when p, = 1.0. These data, although not extensive, suggest that the 
MeNemar test is fairly sensitive to violation of the assumption that 
the tests are equally reliable. This seems especially true with rela- 
tively large sample sizes. For example, it can be seen from Table 3 


TABLE 3 
Empirical Estimates of æ (Nominal a = .05) for Two Tests of H: pu = 1.0 


Reliabilities N =50 N = 100 N = 200 
Prise eua, N.C. MeNemar N.C. McNemar N.C. McNemar 
i 8 .085 054 .069 — .055 .064 050 
aS 7 .003 ^ .049 1076 . .070 1059 — .065 
Js 9 .076 .. .061 1053  .050 1064 — .071 
$ E .078  .086 loci — .087 1053 — .108 
mt 9 ‘077.096 ‘076-101 1007 — .141 
i 6 1095 — .155 1075 — .18 1073.255 
NE Anes e Alo LI m 
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that when N = 200, p-s, = .9 and pys, = .6, the estimated probe. 
bility of rejecting a true hypothesis of 1.0 was approximately .25 
when the nominal a was .05. For this same combination of N, pasy 
and p,,,,, the estimated æ using the normal curve method was .073, 
In general, the results shown in Table 3 indicate that the normal curve 
method provides better control of Type I errors than the McNemar 
procedure if Y is relatively large (N > 50) and the two tests diffe 
in reliability. 

In summary, the evidence supporting the normal curve technique 
for testing hypotheses is not as strong as that for establishing com 
fidence intervals by this technique. However, no great discrepancy 
between true and nominal o-levels would seem to occur through the 
use of this procedure, 


Summary 


Seventy-two empirical sampling distributions of correlations cor 
rected for attenuation (r,,) were generated via an electronic computer. 
The statistical characteristics of these distributions were summarizel 
and compared with corresponding characteristics of the distribution 
of product-moment correlations (r.,). Because the sampling distribu- 
tion of r,, differed from those of Tzn, alternative procedures to Fisher’ 
z-transformation were investigated to establish confidence intervali 
for p,, and to test hypotheses about prs. Evidence was presented fot 
the use of techniques based on normal curve theory for both purpose 
It was concluded that these normal curve procedures provided tho 
experimenter with: reasonably good control of Type I errors. They 
may be used with even more assurance in establishing 90 per ce 
and 95 per cent confidence intervals for p,,. 
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LIKING JUDGMENTS AND MULTIDIMENSIONAL 
SCALING: 


NORMAN CLIFF 
University of Southern California 


Tux present study was similar in rationale to those reported by 
Cliff and Young (1968), Cliff (1967), Carroll and Chang (1967), 
and Doelert and Hoerl (1967). The former two showed that in a 
number of instances univariate judgments of stimuli had quite 
strong and simple relationships with the spaces derived by means of 
a multidimensional scaling analysis (MDA) of the stimuli. The 
latter two showed that this applied also to judgments of liking or 
preference for members of a given set of stimuli. In particular they 
showed that there were preferred or ideal locations on some of the 
dimensions, and stimuli were liked in proportion to their closeness 
to these desired subjective locations. The correspondence between 
the various kinds of judgments and the multidimensional spaces 
have been rationalized by Cliff and Young (1968) as supporting the 
view that the individual has an internalized map of the stimuli, 
which they call a configuration, and that the configuration is ap- 
plied each time a stimulus is responded to but that the mode of 
application differs in different situations. Multidimensional scaling 
is presumed to reveal an approximation to the configuration. 

Keats (1964) did a related series of three studies that took a 
somewhat different approach. His subjects made paired comparison 
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judgments of some single aspect of stimuli, and also judged the 
distance between the stimuli. Keats interpreted the unidimensional 
judgments (preference) as fitting the unfolding model (Benndi 
and Hays, 1960) and used them to construct multidimensional 
arrangements of the stimuli (and subjects). He also used the dis- 
tance judgments to construct another set of multidimensional spaces 
for the stimuli. Then he compared the spaces derived in the two 
ways and attempted to reconcile them. l 

While in all three studies Keats’ stimuli seemed to fit quite well 
into two dimensions (there were only four or five stimuli in each 
study), the two spaces, one derived from multidimensional unfold- 
ing and one derived from multidimensional scaling, were not very 
similar. However, in one of the studies (where the stimuli were 
political parties) he divided the subjects on the basis of party 
membership and constructed a multidimensional scaling space for 
each. He then attempted to reconcile the preference judgments 
within parties to the multidimensional spaces and concluded that, 
generally speaking, the multidimensional scaling locations for the 
stimuli were consistent with the preferences of party members who 
ranked their own party first, but not with the preferences of thost 
who did not. 

Keats’ study was similar to the others in that he attempted p 
relate preference judgments to spaces derived by means of multi 
dimensional scaling. His attempts met with only partial succes 
however, and some of that was on the basis of subdivisions of b? 
subjects which have a post hoc flavor. In the frame of reference t! 
the present study, it is of interest that his spaces were more 10% 
oncilable when he subdivided his subjects into groups which might 
be expected to have different points of view about the stimuli. It yo 
quite possible that those party members who do not prefer ther 
own party have a quite different perception of the relationship 
among the parties. When the multidimensional scaling is done ” 
the basis of the average similarity judgments of all the party men 
bers, the resulting structure may not resemble the one they P% 
ceive, since they are only a minority of party members, and tX! | 
preferences are not at all consistent with that structure. In Sh 
their preferences need not be consistent with a structure they 40” 
perceive. 7 

The present study attempts to cireumvent this problem by posit 
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it in the opposite direction. We will try to find subgroups of sub- 
jects who are fairly homogeneous with respect to their liking or 
preference for the stimuli and try to determine the multidimensional 
structure they have for the stimuli. Carroll and Chang (1967) pre- 
sent a third alternative approach to individual differences. 

The present study is concerned with the degree of liking expressed 
for a given stimulus rather than preference. This is done, firstly, be- 
eause of an expectation that the preference for one stimulus over 
the other may be especially sensitive to context effects. These can 
apparently be fairly strong (Sutcliffe and Bristow, 1966). Sec- 
ondly, a large number of judgments are required if all the stimuli 
are to be presented in pairs, and, since this is also required for 
the similarity judgments, fatigue and boredom are likely to be 
important if both sets of pairwise judgments are required. 

The study, then required two kinds of judgments by the subjects: 
judgment of the similarity of stimulus pairs, and some numerical 
expression of degree of liking for individual stimuli. The liking 
judgments were analyzed by the points of view procedure. (Cliff, 
1968; Young and Pennell, 1967) in order to find patterns of liking 
responses which can serve as prototypes for the responses of the 
whole group of subjects, Then the multidimensional scaling judg- 
ments which parallel these points of view were analyzed to deter- 
mine the multidimensional spaces. Then the liking patterns were 
compared to the positions of the stimuli in the spaces in order 
to determine the degree and kind of relationship. 

The stimuli selected for use in the study were academic fields 
(chemistry, fine arts, sociology, ete.). This set was selected on the 
basis of familiarity to the subjects and feasibility of presentation. 


Method 


Instruments 


The data were gathered using instruments similar to those used 
by Cliff and Young (1968), and Cliff (1967). The 20 fields listed 
in Table 1 were those used in the study. The subjects were in- 
structed that they were to judge the degree of difference between 
academic fields. Then they were presented 150 pairs of the 20 
fields. The 150 consisted of the pairings of each of the 20 with 13 
of the others plus 20 repeated judgments, two involving each of 
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TABLE 1 
Projections of Courses on Dimensions, “Wanting” and “Liking” Loading 
for "Average" Point of View 


Dimension 

1 2 3 4 5 Want — Like 
Adventurous 0.274 0.454 —0.169 —0.102 0.227 4.405 4.968 
Alert 0.557 —0.561 —0.058 0.224 0.057 4.999 5.30 
Argumentative 0.708 —0.245 —0.249 0.264 —0.264 4.407 435 
Boastful —0.575 —0.033 —0.298 0.129 0.119 3.973 498 
Cautious —0.065 0.284 0.560 0.009 —0.230 4.732 5.00. 
Conceited —0.136 —0.100 —0.031 0.300 —0.076 4.905 5.05 


Conscientious 0.512 0.897 —0.240 —0.147 —0.350 4.257 410 | 
Considerate — —0.480 0.115 —0.319 0.208 —0.203 5.131 5.700 


Curious —0.321 0.335 —0.311 —0.012 0.255 4.077 5.80 
Cynical 0.189 —0.090 0.138 —0.529 —0.029 3.644 4.18 - 
Dependable 0.552 —0.127 —0.095 —0.433 0.112 3.577 428 — 
Efficient —0.284 0.012 0.025 —0.381 0.156 4.889 6.1 - 
Excitable 0.462 — 0.254 0.436 0.302 0.121 5.168 La 
Impatient —0.325 0.430 —0.104 0.091 —0.172 4.227 4.704 
Modest —0.465 —0.187 0.164 0.194 0.221 4.407 5.94 
Moody 0.608 0.183 0,017 0.213 0.430 4.339 439 
Nervous |,  —0.41 0.058 0. —0.064 —0.150 4.558 5.450 
Poised - 0.011 —0.475 0.018 —0.222 —0.188 6.340 7.29 
Reflective — —0.518 —0.372 —0.037 —0.023 0.320 3.709 5.25 
Stubborn —0.288 —0.329 0.089 —0.022 —0.206 5.325 6.074 
Brubborn! 1 0:28) — 0/320" 0.089" —0.022 —0.290 5.320 7 


the fields. The 20 to be repeated came first in the questionnaire 
and were not analyzed. The judgment scale went from one (almot , 
identical) to nine (extremely different). 

The subjects also made two responses to the stimuli presented 
singly. They were asked to report, on a one to nine scale, the degre? 
to which they would want to major in the subject, and also their 
liking for the subject-matter. 


Subjects and Administration 


The subjects were 34 students in introductory psychology s 
filling a course requirement that they participate in experiments | 
They made the difference judgments in group sessions, and the &* 


ing ratings were made a few days later at the end of a reg 
class period. 


Jat 


Analysis 


The analysis may be considered a between-persons factor anal 
ysis adapted for special ends. It begins with an analysis of 02° 
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of responses to determine several prototypical response patterns for 
that set. The factor analysis-like procedure allows these patterns 
to be chosen in a way that attempts to insure that there is at 
least one prototypical response pattern that is similar to the ac- 
tual responses of each person, even though there are fewer response 
patterns than persons. These patterns are interpreted as the res- 
ponses of “idealized individuals." The procedure represents a com- 
promise between studying the simple mean response pattern and 


y studying the data from each individual separately. In the present 


data the set so analyzed were the 20 judgments of degree of liking 
for the subject matter. 

A prototypical response pattern, while resembling the actual re- 
sponses of some subjects more than others, may also be viewed 
(Cliff, 1968) as a weighted average of the response patterns of all 
the subjects. In fact, any response pattern defines a vector of 
weights which, when multiplied by the judgment matrix, yields the 
vector which is the response pattern. This vector of weights 1 may be 
applied to a second matrix of responses by the same subjects to 
yield a corresponding prototypical response pattern for the second 
set of responses: This set parallels the first in the sense that the 
same vector of weights is used in computing it as was used in the 
first. Then the patterns derived from the two sets of responses are 
parallel in the sense that the responses of the subjects were weighted 
in the same way in arriving at them. 

In the present study, the 34 by 20 matrix of responses to the 
liking for subject matter section was analyzed first. Seven prototyp- 
ical response patterns or “points of view,” each consisting of “load- 
ings” for the 20 stimuli, seemed necessary to cover the gamut of 
responses made by the 34 subjects. One of these points of view, 
the one most like the arithmetic average, is given as the last 
column of Table 1. The seven sets of weights for the 34 individuals 
Were also computed. These were applied to two other sets of data 
to produce the corresponding response patterns for those responses. 
The two additional sets were the 20 ratings of desire to have the 
field as a major and the 150 ratings of degree of similarity of the 
Subject matter. The analysis to this point follows the rationale 
Provided by Cliff (1968) and Tucker and Messick (1963), using & 
computer routine developed by Young and Pennell (1967). 

The sets of ratings of degree of similarity were treated as distance 
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estimates and used as input to the Young-Torgerson procedure 
(Young and Torgerson, 1967) which is similar in rationale to 
Kruskal’s (1964). For each point of view, this yields the projec. 
tions of the stimuli on underlying dimensions. Then the degree of 
relation between liking for a subject and its position in the spave 
was determined by a series of correlational and regression analyses 
within points of view. The latter were designed to find most-likel 
directions in the space and/or most-liked locations on the dimen- 
sions. The most-liked direction analyses were made by computing 
multiple linear regression functions predicting degree of liking by 
means of location on the dimensions. Most-preferred locations it 
the space can be found by multiple regression analyses in which 
the squares of projections on the dimensions and products of load- 
ings on different dimensions are used as predictors of liking (Carroll 
and Chang, 1967). The regression analyses were carried out fot 
each point of view. Certain additional correlations between points 
of view were computed. 


- Results 


In the case of all seven of the points of view it appeared that 
most probably five dimensions were necessary to account for the 
distances which have been formed out of the difference judgments 
The first dimension was always appreciably larger than the rest 
but the latter were usually all more or less equal in size. The 
Projections of the points on the dimensions are given in Table! 
for the average point of view. Inspection of the dimensions reveal 
that the first dimension had a hard-soft (hard-casy?) characte 
and the second usually seemed to represent a further differentiation 
of the “hard” into physical-mathematical vs. biological-humé 
Except for the first and usually the second dimensions, there W% 
not great comparability across points of view. 

The dimension loadings were correlated across the points of viet 
and the correlations examined to see if there were factors that 09 
be rotated to a matching position. Inspection of the correlati? 
indicated that, while there was almost complete consistency V! 
Tespect to the first dimension across points of view, the amou! 
decreased appreciably with the smaller dimensions, A second Ë 
mension was almost always fairly comparable, or could be made“ 
by rotation, and the same was true of one or more others in 
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cases. The accuracy of the match, though, decrased for smaller di- 
mensions even in those cases where a significant degree of associa- 
tion could be demonstrated. In general, then, there was some, but 
not complete, similarity of the structures across the points of view 
with the first dimensions being almost identical in all cases. Insofar 
as the structures for different points of view are not identical, they 
may represent differences in the way the subjects see the fields, 
but these differences are rather subtle. However, since five dimen- 
sions are only moderately overdetermined by 130 distances among 
20 points, it may be that the solutions differ largely as a result 
of instability or experimental error. 

Table 2 shows that there is great diversity among points of view 
with respect to the two liking ratings. A number of the correlations 
are negative, and most are low, indicating that generally speaking 
there is little agreement between the points of view concerning which 
are the well liked fields and which are not. Thus, we really do haye 
points of view. The third and fifth points of view (C and E) are 


TABLE 2 
Correlations between Ratings by Points of View 
Want as Major Ratings 

A B o] D E TU 
B .25 
C —.43 —.20 
D —.33 .00 69** 
E —.48* —.12 .99** .T0** 
F -.09 —.85 157%% —.19 .53* 
G 4 bee — .45* i30 — 46 22 

Like Subject Ratings 

A B c D E F 
B —.09 
€ —.19 —.02 
D —.20 .08 2s L 
E -.25 .06 .99** .T25* 1 
F .09 —.23 IBI 0 038 9 e888 
G 28 E qe -57 WEM 048 

Correlations between “Like” and “Want” 
A B c D E F G 
«90** .92** 96** 96** .96** .96** .8a** 


4 H 
* Correlation significant at 
videns p< .05. 
** Correlation significant at p < .01. 
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an exception, being in nearly complete agreement. It may be of 
interest to note that the point of view which is most like a 
ordinary average is G; it shows low or moderate positive correla- 
tions with all the others. 

The generally low correlations here are in marked contrast to the 
correlations between the first dimension loadings among all points 
of view, which were .94 or more. Also, there are very high correla- 
tions between the two liking ratings for a given point of view, as 
can be seen in the lowest part of Table 2. These findings indicate 
that there was considerable difference across subjects in feelings 
about the fields, that there was close agreement between the two 
kinds of ratings, and that there were some differences in multidi- 
mensional structures between the different points of view. 

Our main interest centers around whether the valence reflected 
by the ratings is related in any simple fashion to the multidimen- 
sional spaces. The principal evidence is in Table 3 where the cor- 

- relations between the liking loadings and the dimensions are given 
- along with the multiple correlations between all five dimensions. 


v. a ^ 
; : TABLE 3 
% Correlations of Liking with Dimensions 
PS SS lr 
Dimension 
POV 1 2 3 4 5 Multiple B 
it elm gga ABC P n — —— ei 
jor 26170.10.209 7.1898" ;187 —.407 820 
Ay Pee SOT EAL untae 997 401 T0 
Major 415 01 
B j :333 —.221 093 —.101  .6 
, Subject 1206 236 —.220 —.031 040  .88 
Major .815** o 
c ij 14 —.171 —.018 —.092  .84 
Subject sie — 7x17 —-193 ^ QS TO gso 
Major .569** 4 
D j 051 —.206 —. =. -63 
Subject 572% “gus Liao Bar Tiis 1689 


Major y " 
E pr :820** — .094 —.019 —.003 —.013  .825 
BUDISE ye S19 nes iis, cen 189" 


Maj A * 
F jor -632* .222 — .002 .T04 
sors di EC C T Too E EC 


Ge N A vagy c 599.808 


Lo 41 mec LN TNNT ECT 


tion significant at p < 01, 
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TABLE 4 
Significant Partial Correlations with Squares and 
Products of Projections 
POV Significant partials 
n Major .530* with 2 X 4 
Subject none (.435 with 2 X 4) 
B Major .508* with 1 X 2 
Subject .520* with 1 X 2 
c Major —.696** with 12; —.582** with 4? 
Subject —.477* with 13; — .503 with 4* 
D Major none ^ 
Subject none 
E Major —.700** with 1? he 
Subject — .492* with 1? 4 
F Major none 
Subject none 
G Major none 
Subject none y 


* Partial correlation significant at p < .05. 
** Partial correlation significant at p < .01. 


This allows us to evaluate the degree to which a point oi view 
reflects a preferred direction in the space. 

The table shows that there are significant correlations between 

liking and the first dimension in most cases and that sometimes 
the correlations are rather high. Where they are highest, the mul- 
tiple correlation is also significant. The correlations, though, appear 
not to be high enough to warrant the conclusion that there is a 
perfectly regular relation between the liking and direction in the 
space. 
" Next, the analysis evaluating the presence of preferred locations 
in the space was performed. As described earlier, this involves cor- 
relating the degree of liking for a stimulus with the squares and 
produets of projections on the dimensions as well as the projections 
themselves. Here, a substantial correlation with a dimension itself 
is presumed to reflect not a preferred direction in the space but 
rather a preferred location which happens to be beyond the set of 
points representing the stimuli. 

This analysis has the undersirable feature that it results in a 
large number of predictor variables; for k dimensions there are the 
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k sets of projections themselves, the k sets of squared projeetior 
and the k(k — 1)/2 sets of products of projections. In the prese 
data this implies a total of 20 variables to be correlated with t 
liking data. Since there are only 20 points, this precludes the ap 
plication of a multiple regression approach to the whole set, T 
approach here was to search for significant correlations of lili 
with the squares and products when correlations with the dime 
sions themselves partialled out. Significant partial correlations 
this type were found for four of the seven points of view, includi 
B which had no significant correlations in Table 3. Table 4 lists! 
significant partial correlations. 
Discuss; 
The investigation of human likes and dislikes has traditional 
followed either a unidimensional or a subjects by stimuli appro 
In the former, stimuli are presented singly or in pairs to subjects 
report in any of a number of ways their degree of liking (admi 
tion, preference, etc.) for the stimuli; then subjects are treated 
replications and the stimuli are “scaled” by some procedure | 
rived from psychophysics. The latter may be based on a sop! 
ticated or elaborate model such as Luce’s (Luce, 1959) or T 
stone’s (Torgerson, 1958), and the degree of success of the seal 
then constitutes a test of the model. The subjects by stimuli 
proach is basically factor-analytic. It attempts to use the fact 
different people like different things to isolate those characteris 
of stimuli which differentiate people. Here, the most formal 
ment is Coombs’ (Coombs, 1964) but most empirical work foll 
Tucker's vector model (Tucker, 1960), at least implicitly. The 
are not designed to lead to understanding of preference; they J 


information on the characteristics of stimuli which lead to then 
intersubject disagreement. 


sional Spaces suggested that an individual's preferences should 
late in some fairly simple way to how he perceived the sti 
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More elaborate models should follow rather than precede the deter- 
mination of how preferences relate to individual perceptions. Also, 
individual differences should not be the means of finding the pri- 
mary stimulus dimensions. Rather, the individual differences in 
preference should relate either to individual differences in primary 
stimulus dimensions or to different ways of using similar ones. 
We attempt to find both preferences and perceptions independently, 
and relate the two to each other. 

Our attempt met with some success in that there are significant 
correlations in Table 3 and some of them are fairly high. However, 
in view of the considerable degree of agreement on the dimensions 
across the points of view and the high correlations between the 
two kinds of ratings for each point of view (which may be taken 
as a lower bound repeated measures reliability) the relations do 
not appear to be strong enough to justify the conclusion that, for 
the individual, preference corresponds to a single direction in the 
stimulus space. 

Two studies (Carroll and Chang, 1967; Doelert and Hoerl, 
1967) have recently been reported which found that liking or pref- 
erence for a certain stimulus depended upon its distance, along 
particular dimensions, from an ideal point. This implies that liking 
Will tend to be correlated negatively with the squared projections of 
the points on the relevant dimensions provided that (a) the ideal 
Points lie toward the middle of the distribution of stimulus points, 
and (b) the axes are aligned with the relevant dimensions. To the 
degree that (a) is not true and the ideal point does not lie at 
the center of the stimuli the correlation will be with the projections 
themselves rather than with the squares. If axes are located so that 
they lie between relevant and irrelevant dimensions, the correlations 
will tend to be with products of projections rather than with squares. 
The analysis summarized in Table 4 was performed with a view 
toward evaluating the possibility that liking related to distance 
from an ideal point in the present instance as well. The analysis 
Tevealed a sufficient number of relations of the indicated form to 
Suggest that the individual does use this mode of responding, at 
least approximately. 

The fact that the relations found here are not as strong as those 
Teported by others either for similar judgments, as in the case of 
Carroll and Chang (1967) and Doelert and Hoerl (1967), or 


somewhat different ones (Cliff and Young, 1968), may be explaind 
in a variety of ways. First, it may be that the large number of 
dimensions relative to the number of distances may have led to 
some being poorly defined. Second, it may be that our implicit as- 
sumption that the function relating liking to the dimensions had 
only a single maximum may be in error. It may be that liking 
is functionally related to the dimensions but the function has sev- 
eral maxima, perhaps widely separated. Evaluation of this possi- 
bility would require use of a much denser set of stimuli, and this 
is not feasible with current methodology. Finally, it may be that 
our basic assumption is wrong and the difference judgments and 
liking judgments are just based on different aspects of the stimuli 
and no strong relations are to be expected. The latter seems unlikely 
in view of the success of earlier studies and the quite significant 
degrees of relationship found here; all seven points of view showed 
at least one significant correlation with either the dimensions 
themselves or their squares or products. 
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 Sermenson (1953) proposed that types could be identified by 
int lating actial persons or group representatives, on the basis 
of their responses to ipsatively presented items, and by factor ana- 
lying the resultant matrix. Gordon (1969) tested Stephenson's 
typological model by treating the Survey of Interpersonal Values 
, 1960) as a structured Q-sort (Gordon and Hofmann, 
1968) and by factor analyzing intercorrelations based on mean 
ores of 59 different American samples, The outcome of this analy- 
sis was high positive. Four factors accounted for more than 98 
Per cent of the variance. Each of the factors was clearly defined 
[ p^ sot of groups which had in common characteristics making 

plausible their having substantial loadings on the same 
or. Mean trait scores of the defining groups were meaningful 


3 Or group patterns to established reference groups or types was re- 
ferred to as “Q-typing.” 
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were intercorrelated on the basis of their means on the Survey of 
Interpersonal Values (SIV), in the original and in translation, and 
the matrix of student samples was factor analyzed. The resultant 
factors clearly differentiated among cultures. When marker varia- 
bles from the original American analysis were combined with the 
student samples and the new matrix was refactored, it was found 
that the oriental samples could be described both in terms of 
the original American types, reflecting a generality of types acros 
cultures, and correlationally in terms of specific American reference 
groups. 

Additional research has been undertaken to assess the utility of 
information obtained from type development and from the cor 
relational approach on which it is based. Results so far available 
suggest that Q-typing methodology will be of value for studying 
the images of political figures, the development of job families on 
the basis of similarities in motivational or other personality varia- 
bles, and related applications. 

The present study was designed to further test Stephenson’s model, 
using instrumentation other than the SIV. The Edwards Personal 
Preference Schedule (Edwards, 1959), or EPPS, was selected for 
this purpose since it is ipsative, and it measures a large numbet 
of variables, potentially being able to yield a much greater number 
of factors than the four that had been obtained with the SIV. 


Samples 


_ Samples used in the present study were obtained from a sereen- 
ing of EPPS references in Buros (1964), and in the Psychological 
Abstracts for 1964 and 1965. Forty-five groups were included in 
the present analysis, of which 30 (1 to 30) were male and 1509! 
to 45) were female. Following is a brief description of each+ 

a. Groups 1 through 3 consisted of male prisoners and delin- 
quents: (1) prisoners at the Marion (Ohio) Correctional Insti 
tion; (2) federal prisoners at Terminal Island; and (3) juvet? 
delinquents at the Boy's Industrial School, Lancaster, Ohio. 

b. Groups 4 through 10 consisted of male hospitalized ment! 
patients and aides: (4) psychiatric patients (Iowa); (5) neuro 


aaa source of data for each group will be found in the reference 97 
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(6) character disordered and (7) schizophrenic patients in a Vet- 
erans Administration hospital; (8) schizophrenics, tested shortly 
after admission, who later were found to be conditionable; and 
(9) who were found to be nonconditionable; and (10) psychiatric 
aides (Connecticut). 

c. Groups 11 through 13 were comprised of male military person- 
nel; (11) Air Force volunteer enlisted men in their first week in 
basic training; (12) Navy enlisted men who volunteered for sub- 
marine training; and (13) Navy officers in the submarine service. 

d. Groups 14 through 20 were employed males: (14) managerial 
personnel and (15) salesmen, both tested for industrial clients at 
Western Reserve University; (16) foremen at the New York Tele- 
phone Company; (17) foremen in a soap manufacturing company; 
(18) mechanical engineers from various companies; (19) certified 
public accountants in various agencies; and (20) the normative 
male adult sample. 

e. Groups 21 through 30 consisted of male educational adminis- 
trators, students, and teachers; (21) educational administrators 
tested in an in-service training program; (22) junior pharmacy 
students (Oklahoma) ; (23) first year medical school students; (24) 
high school teachers and (25) science high school teachers enrolled 
in an institute in science teaching; (26) freshmen and sophomore 
education students; (27) a national sample of high school students; 
(28) Negro college students at Philander Smith College, Arkansas; 
(29) Seattle high school students and (30) Caucasian Hawaiian 
high school students. 

f. Groups 31 through 35 consisted of female samples parallel to 
male groups identified above: (31) prisoners at the Ohio Reforma- 
tory for Women; (32) juvenile delinquents at the Girls Industrial 
School, Delaware, Ohio; (33) psychiatric patients (Iowa); (34) 
Psychiatric aides (Connecticut); and (35) the normative female 
adult sample. 

g. Groups 36 through 45 consisted of female students and teach- 
ers: (36) student nurses in training (South Carolina); (37) ele- 
mentary and (38) secondary teachers (Illinois); (39) physical 
education teachers; (40) graduate students and (41) seniors, se- 
lected as successful from a national sampling; (42) a national sam- 
ple of college students (43) Negro students at Philander Smith 
College, Arkansas; (44) Seattle high school students; and (45) Cau- 
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casian Hawaiian high school students (the last four also parallelitg 
previously identified male groups). 


Procedure 


The procedure followed was identical to that used in the anal- 
ysis of the group data for the SIV (Gordon, 1969). Groups were 
intercorrelated on the basis of their mean scores on the 15 EPP 
need scales, and the resultant intercorrelation matrix was factor 
analyzed by the principal-components method. The first five factors 
which accounted for 95.3 per cent of the variance were subjected to 
varimax rotation. The orthogonal factor loadings are presented in 
Table 12 


TABLE 1 
Rotated Orthogonal Factor Loadings 
Group N I II III IV 
1. Prisoners (Ohio) (M) 190  .46  .68  .05  .38 = 
2. Federal Prisoners LIE i56 5.14 500 


3. Juvenile Delinquents (M) 115 —.15 .05 .20 .93 = 


4. Psychiatric Patients (M) 159  .79 lA  .35  .09 Wd 
5. Neurotics 20 56 37 .62  .20 -Ï 
6. Patients-Character Disorders 20 83 145 —.14 .03 & 
7. Schizophrenies A 20 RTI .35 43 = 
8. Schizophrenics (Cond) B 30 .90 —.02 16 -1 
9. Schizophrenies (Non-Cond)B 30 54 — .36  .33 —.26 =f 
10. Psychiatric Aides (M) BB) ib. 0.58 0.08 —.00) a 
11. Air Force Enlisted Men 144 15 24 .33 AT 
12 MY Enlisted Men i Fy 
(Submarine) 12020550225 1 35. 301080 
13. Navy Officers (Submarine) 90 lo 90 16 25 3 
14. Mansgers 93 37 .89 03 —. 14 
15. Salesmen 50 —02 i94 26 00 4 
16. Foremen (Utility) 84 10 .90 10 .26 E 
1T. Foremen (Manufacturing) 56  .67  .63 —.03 05 
18. Engineers-Mechanical 50 -.03 91 —.27  .1B 7g 
19. Certified Publio Accounts BO ye 12 Iou. — 14 .20 TN 
20. General Adult (M) 100 7776 "^53 —.01 —.12 3 
21. Education Administrators 32 33 72 50 —.18 ^w 
22. Pharmacy Students 50.09 .30 21 85 8 
2i MedislStudents (Ist year) 60 17 165 39.2 73 
- High School Teachers (M) BONUR lez el 06 — 
25. Bei School Teachers à d 
cience (M) —.05 
26. Education Students (M) ae H a M 12 » 
27. College Students (M) 760 —.41 53 4 AL A 
28. College Students—Negro (M) 63 65 44 91 SON 


eiu ide 
* Unities were used in the diagonals, as in the previous analyses. 
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TABLE 1 (Continued) 
Rotated Orthogonal Factor Loadings 


29. High School Students 


l (Seattle) (M) 799 —.24 aT .30 83 .07 
- 80. High School Students 

(Hawaii) (M) 89 —.37 +32 .25 7% —.13 
31. Prisoners Ohio (F) 134 17 —.10 .25 .51 —.28 
32. Juvenile Delinquents (F) 104 .42 —.20  .36 .4 —.10 
33. Psychiatrie Patients (F) 103 .85 —.21 AL .00 .05 
34. Psychiatric Aides (F) 40 .92 .06 .2 —.23 .03 
35. General Adult (F) 100 .87 —.14 .35 —.15 14 
36. Student Nurses (F) 222 .39 .23 -50 .07 .65 
37. Student Teachers—A (F) 73:5. S 309... BD LOTO TUTTO 
38. Student Teachers—B (F) 89 24 27 .86  .29 —.08 
39. Physical Education 

Teachers (F) 100.54  .36  .72 06 09 
40. Physical Education Grad. 

Students (F) 55 31 .18 .86 .23 —.02 
41. Physical Education Seniors (F) 100 .27 .23 .88 .38 .04 
42. College Students (F) 749 .18 .04 91 36 —.02 
43, College Student—Negro(F) 108 .79 .20 40 13 —.08 
44. High School Students 

—Seattle (F) 834  .20 —.29  .75  .49 19 
45. High School Students 

—Hawaii (F) 57 .28 —.30 -71 .52 .05 


Each factor was interpreted by noting first the nature of the 
groups that had the higher loadings, and then the needs which 
uniquely differentiated those groups from groups defining other fac- 
tors. For the latter purpose, for each factor the simple average of 
the means of the four groups with the highest loadings was com- 
puted for each need. The average means are presented in Table 
2 for each need together with the median and range for the 45 
samples. Average values that fall within the upper and lower 15 
Per cent of the entire distribution are arbitrarily treated as being 
high and low, respectively. 


Results 
Factor I is called Docility. Both male and female samples are 
Tepresented by high loadings on this factor. However, of the four 
Samples with the higher loadings three were female: psychiatric 
Patients, psychiatric aides and adults in general. Male condition- 


SS 
"Due to a tied rank, five groups were averaged for Factor II. 
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TABLE 2 


Median and Range of EPPS Means for All Groups and Overall Means 
for the Set of Groups Defining Each Factor* 


un —————— — 


Seales Median Range Factor? 
I II II WW 
Achievement, 14.7 8.2 18.5. 18.0 12.0 137 
Deference 13.4 5.8 15.3. 12.3 13,1 i 
Order 12.8 6.7 15.8 13.1 114 107 
Exhibition 13.2 5.7 12.0 141 143 159 
Automony 12.6 5.5 12.1 12.0 12:7 13.9 
Affiliation 15.0 6.2 16:2, 718.7 | 17.3 See 
Intraception 15.9 6.8 25/73114.2. 17.7, ele 
Succorance 10.8 03197795 8. 1.94308 
Dominance 14.2 10.2 10.9 19.3 14.4 13 
Abasement 14.7 7.3 Wl 3123 149 149 
Nurturance 15.6 6.6 18.0 12.6 15.9 142 
Change 15.5 6.3 35.4 14.7 17.5 166 
Endurance 15.4 7.0 ISATS 13.8 14 
Heterosexuality 13.4 13.1 8.8 14.7 13.5 18.0 
Aggression 1.7 5.5 304 12.55 9.9 143 
7 TE Groupa Ga which tne overal Means Were oa — 7 

I. 8, 33, 34, š mputed. 

Valias awa i app ang; NUT 20,00 V Nono a 


able schizophrenics comprise the fourth. These defining groups are 
distinguished from those characterizing the remaining factors in 
being high in Deference, Order, Suecorance, Abasement and Nur- 
turance and low in Dominance and Heterosexuality. It may be 
noted (Table 1) that male psychiatric patients and psychiatri 
aides, Air Force enlisted men, foremen (utility), male adults, fe- 
male prisoners and Negro students of both sexes also have their 
highest loading on this factor. This factor has been labeled Docility 
in that deference, submissiveness, lack of feeling of personal worth, 
need for system, and need to help others and to be helped by them 
are the predominant characteristics of the defining groups, and at 
reasonably applicable to other groups with high loadings. 

Factor II, Dominant Striving, is a masculine factor. Salesmety 
certified public accountants, mechanical engineers, foremen and 
Navy officers have the more extreme loadings, Typical group men 
bers are particularly high in Dominance, Achievement and Endur* 
ance, and low in Nurturance, Succorance and Abasement. No fe 
male sample has a significant loading in this factor. Other male | 
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samples that have their highest loading in this factor include 
male prisoners, managers, education administrators, medical stu- 
dents, high school teachers, education students and college students 
- in general. The extreme Dominance and Achievement means for 
defining groups served as the basis for the naming of this factor. 

Factor III labeled Friendly Interest has predominately female 
representation. Female college students, physical education students 
and student teachers have very high loadings. Female high school 
students and physical education teachers and male neuroties have 
their highest loadings on this factor. The defining groups are char- 
acterized by high means on Affiliation, Intraception and Change and 
low means on Aggression, Endurance, and Achievement. Their over- 
all profile reflects a need to be with people and understand them 
and a liking for variety, but little achievement motivation or ag- 
gression. 

Factor IV called Adolescent Revolt, is primarily masculine in 
character, with male juvenile delinquents, pharmacy students and 
high school students having the higher loadings. High means on 
Heterosexuality, Aggression, and Exhibition and low means on 
Deference and Order typify these samples. Adolescent Revolt appears 
to be suitable descriptive label, in view of both the mean profile 
and the substantially younger age of the defining groups. This inter- 
pretation is supported by the fact that young Navy enlisted men 
and female juvenile delinquents also have their highest loadings on 
this factor, and that the two female high school student samples 
have significant loadings as well. 

Factor V is uninterpretable. Student nurses and medical school 
students have moderately high positive and negative loadings, re- 
§pectively, but no reasonable conclusions may be drawn regarding 
its nature, 

Another method that was used to assess the meaningfulness of 
wee (Gordon, 1969) consisted of determining the applicabil- 
ves factor descriptions to groups that had moderate or high 
th ings on more than one factor. Following are interpretations 

groups of this type. 
e fes of the following male groups, psychiatric aides, manufac- 
nae AN Ohio prisoners, patients with character disorders, 
ites ege students and adults in general, is characterized by 
e oadings on Docility and Dominant Striving. Members of 
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the first two groups have definite duality of interpersonal orienta. 
tion, control over those in their charge and obedience to authority; 
many of the members of the male adult normative sample were 
undoubtedly in similar positions occupationally; the male Negro 
college student may very well have felt the need to exhibit Docility 
within the southern community and Dominant Striving within his 
own. 

b. Male hospitalized neurotics, and female psychiatric patients, 
student teachers, physical education teachers and Negro college 
students have positive loadings on Docility and Friendly Interest, 
Docility on the part of the institutionalized (clinically and educa- - 
tionally) and the Negro female is quite understandable. Friendly 
Interest is found to typify educators and females. l 

c. Air Force basic trainees, female prisoners and female juvenile 
delinquents are represented on Docility and Adolescent Revolt. 1 
High loadings on Docility are certainly reasonable for these groups 
who are living under strict discipline. The need for aggression ani 
sexual activity that they report is not unexpected. 

d. Male education administrators, high school teachers and edt- | 
cation students are represented on Dominant Striving and Friendly 
Interest, The former is consonant with their being in or aspiring W 
positions of control, the latter an interest in and desire to unde 
stand those in their charge. Male college students have similar 
Tepresentation on these factors, with a positive and negative load 
ing on Adolescent Revolt and Docility, respectively, as well. 

e. Federal prisoners and Navy enlisted men who volunteered fot 
submarine duty have significant loadings on Dominant striving and 
Adolescent Revolt, the former also having a significant loading ® ' 
Docility. Certain of these relationships appear to be logical. s 

f. Both the Seattle and Hawaiian groups of female high "n 
students have significant; loadings on Friendly Interest and Adolé | 
cent Revolt, which are plausible characteristics of these yous" 
women. 

" No; statistical formulae exists for the determination of the ™ ` 
liability of mean trait patterns on which the correlations are ba ] 
However, intercorrelations between pairs of groups which represi" 
somewhat similar populations, provide some empirical evidente em 
CEDE stability of these patterns, Five such pairs of groups 
be identified in the present study: male federal and state pris% 
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(95); Hawaiian and Seattle female high-school students (.96); 

‘Hawaiian and Seattle male high-school students (.93); female 

elementary and secondary student teachers (.89); and female phys- 

ical education seniors and graduate students (.96). The median 

correlation of .93 for this set is much higher than the median 
correlation of .49 between these samples and all other samples and 
is comparable in magnitude to that found with the SIV in a similar 
analysis (Gordon, 1969). The level of intrapair correlations would 
be expected to be even higher if random samples from identical 
populations had been available. 


Discussion 


Q-typing methodology appears to yield meaningful results when 
the EPPS is used as the research instrument. A reasonably clear 
factor structure was obtained on the basis of analytic rotation, 
And the resultant factors were interpretable both in terms of known 
characteristics of the defining groups and the means which charac- 
terized these groups. 

A comparison between the factors yielded in the original (Gor- 
don, 1969) and present studies is of interest. (1) “Control of 
Others,” in the SIV analysis, was defined by military officers, 
salesmen and business executives and was typified by high Leader- 
ship and low Support means. Dominant Striving closely resembles 
this factor with similar groups and the counterpart needs of Domin- 
bs and Succorance being represented. (2) “Institutional Res- 
traint” was the second SIV factor, being defined by prisoners and 
nurses and with military enlisted personnel also having high load- 
mgs, High Conformity and low Leadership means characterized 

F ing groups. Docility is similar to Institutional Restraint, 
With analogous groups entering into its definition and high Order 
P erence and low Dominance means typifying these samples. 
P to Others,” the third factor, was described by teachers, 

ĉace Corps volunteers, first year medical school students, all of 
Whom had high Benevolence, moderate Leadership and below aver- 
age Conformity means. Friendly Interest is somewhat similar to 

ce to Others” in that the service oriented occupations, teach- 
Sand student nurses, contribute to its definition. Also, EPPS means 
"part dimensions (Nurturance, Dominance and Defer- 

ence) deviate from the general mean in directions predictable from 
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SIV results. (4) The fourth SIV factor, “Self Determination,’ ; 
typified by high Independence means and groups such as gifted 
high school students, college students, psychiatrists, does not have 
a parallel in the present analysis. Similarly the fourth EPPS fao- 
tor, Adolescent Revolt, is unique to the present study. Adolescent 
Revolt is characterized particularly by high Heterosexuality means 
and is identified by members of the younger age groups. 

Groups with significant loadings on two or more factors, for 
the most part, could be meaningfully described in terms of the gen- 
eral definitions of the respective factors. However, in a number of 
instances, relevant means for these groups did not deviate from 
the general mean in the same direction as did corresponding means 
of the defining groups (Table 2). Four or more scales typically 
entered into each factor definition, and significant correlations be- 
tween defining groups and factorially complex groups were usually 
based on a high degree of correspondence between means of only 
some of these scales. On the other hand, in the SIV analysis, only 
one high and one low mean characterized a factor and its de 
fining groups, so that groups that had significant loading on two 
factors, typically had the same two high and two low means thi 
characterized both sets of defining groups. 

In conclusion, results of the present test of Stephenson's typolot 
ical model support the first set of findings that meaningful intet- 
pretable types can be generated through factor analysis by inte 
correlating groups on the basis of their responses to ipsativel 
presented items. It was theoretically possible to obtain as many 9 
14 interpretable factors from the EPPS matrix. However, despi? 
the diversity of samples included, only four emerged, and these t 
counted for almost all of the variance in the matrix. Since foi! 
factors also were yielded in the original study with the SIN, 
present, findings strongly suggest that relatively few patterns m% 
account for similarities and differences in basic motivational ote” 


tations and that simple instrumentation may be adequate for ie 
identification and measurement, 
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SEX DIFFERENCES IN SELF-DESCRIPTION 
ON THE ADJECTIVE CHECK LIST* 


GEORGE V. C. PARKER 
The University of Texas at Austin 


Fon several of the more widely used standard personality inven- 
tories, there have been developed useful scales specifically for 
measuring sex differences in response to the inventory items. In ad- 
dition to the Masculinity scale (M) of the Guilford-Zimmerman 
Temperment Survey (GZTS) (Guilford and Zimmerman, 1949), 
designed primarily to measure sex differences with regard to inter- 
ests and emotional expressivness, the most significant scales of this 
kind are the Masculinity-feminity (MF) scale of the Minnesota 
Multphasie Personality Inventory (MMPI) (Dahlstrom and 
Welsh, 1960) and the Femininity (Fe) scale of the California, 
Psychological Inventory (CPI) (Gough, 1957). Another personal- 
ity assessment technique coming into wider use for research and 
clinical purposes is the Gough Adjective Check List (ACL). How- 
ever, although there are a number of existing scales for this inven- 
tory (Gough and Heilbrun, 1965), there has been no reported 
investigation thus far identifying systematically sex differences as- 
sociated with ACL item endorsement. 

To help broaden understanding of masculinity and femininity in 
self-concept and self-description behavior, the present research was 
directed toward two objectives: (a) to develop empirically an ACL 
femininity scale through identification of ACL adjectives for which 


eo 
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reliable sex differences in frequency of endorsement exist, and (b) 
for validational purposes, to relate responses to the new scale to 
performance on the existing measures of masculinity and fem- 
ininity. Beyond theoretical significance, the empirically-derived 
ACL femininity (Fem) scale, based on highly stable samples, 
could conceivably be of practical significance when used with other 
personality measures in such applied settings as university counsel- 
ing centers. The ACL Fem scale may also be of value to research- 
ers in a number of fields concerned with the role of masculine or 
feminine response tendencies in self-conceptualization. 


Method 


Virtually the entire 1965 freshman class of the University of 
Texas at Austin completed the ACL as part of the regular entrance 
testing program which is administered annually by the University 
Testing and Counseling Center. A total of 5017 students were 
tested: 2212 females (mean age = 18.7) and 2805 males (mean 
age = 18.6). The raw data for the ACL were recorded by the stu- 
dents on standard IBM true-false answer sheets, which were then 
converted into punched card form by an IBM 1230 optical scoring 
machine. A computer program was written for a CDC 1604 com- 
puter and used to transfer the puncheard data to magnetic tape. 
Subsequent processing was carried out on a CDC 6600 computer 
using standard programs described elsewhere (Veldman, 1967). 


Procedures 


Development of the ACL Fem scale was accomplished through 
several steps. In the first major step, differences between male and 
female Ss in relative proportions of item endorsement were com- 
pared for each of the 300 ACL adjectives by z tests (McNemar, 
1962, p. 60). In part because of the large N’s involved, this analysis 
showed that using the traditionally accepted levels of statistical 
significance resulted in an extremely large proportion of adjectives 
associated with statistically reliable differences in sex-endorsement 
rates. Consequently, alpha = .00005 was established as the criterion 
for defining statistically significant sex differences, Essentially, this 
criterion was used in order to identify a sufficient number of items 
for scale reliability, as well as to maintain a relatively high contri- 
bution of sex-discrimination power among selected adjectives. The 


| 
| 
| 


d GEORGE V. C. PARKER 101 


n resulted in identification of a total of 133 sex-related ad- 
‘94 feminine and 39 masculine items). 
second major step involved relating performance on the 
em scale to performance on the CPI Fe, MMPI Mf, and 
M scales. Ss for this phase were 147 male and 160 female 
zaduates enrolled in an introductory psychology class. S8 
iven, under normal instructions, the ACL and a composite of 
nly ordered items comprising the Fe, Mf, and M scales. All 
core data were converted to standard scores for correlational 


Results and. Discussion. 


le 1 lists the ACL adjectives for which sex differences in pro- 
ms of endorsement met the selection criterion. These items 
ise the ACL Fem scale. Because there are 94 feminine items 
masculine items, ACL Fem scale scoring was organized in the 
live-minus-contraindieative manner typical of many other 
cales. Thus, an individual's ACL Fem scale Raw Score = X 
minine items minus X Masculine items. The average raw score on 
he ACL Fem scale for an independent sample of 147 males = 20.07; 
160 females = 34.88; t = 12.21, p < 01. This compares closely 
data for the original sample: male X = 20.66, N = 2805; 
X = 35.26, N = 2212; t = 64.58, p < .0001. A test-retest 
edure, using an independent sample of 50 male and 50 female 
graduates, who were given the ACL twice under standard con- 
ions, with an interval of three months, yielded acceptably high 
bility coefficients forthe Fem scale items: males = .86, females 


TABLE 1 
ACL Femininity Scale Adjectives 


tive Items (Feminine, N = 94) 


95 Male % Female % Male % Female 
Endorse- Endorse- Endorse- Endorse- 
ment ment Adjective ment ment 
61 74 charming 19 27 
80 86 cheerful 65 80 
22 31 complicated 42 51 
33 63 confused 24 39 
20 27 conscientious 69 75 
65 78 considerate 80 86 
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TABLE 1—Continued 
contented 27 43 praising 24 33 
conventional 21 28 prudish 4 15 
cooperative 83 90 rattlebrained 3 9 
cowardly 6 13 responsible 66 76 
dependent 22 32 selfish 14 22 
disorderly 14 20 sensitive 53 72 
dreamy 26 43 sentimental 54 80 
effeminate 2 6 simple 15 22 
emotional 43 65 sincere 68 81 
enthusiastic 63 7 snobbish 5 9 
excitable 42 59 soft-hearted 49 68 
fearful 12 25 spontaneous 20 33 
feminine 1 74 spunky 15 23 ) 
fickle 10 31 stubborn 39 50 
flirtatious 23 45 submissive 10 16 
foolish 11 16 superstitious 7 14 
forgiving 66 81 sympathetic 55 77 
friendly 82 88 talkative 32 49 
frivolous 7 12 temperamental 25 37 
generous 52 60 tense 20 26 
gentle 57 67 thoughtful 56 70 
hard-headed 46 53 timid 15 20 
headstrong 26 33 tolerant 50 59 
helpful 74 82 touchy 18 27 
high-strung 18 25 trusting 48 64 
hurried 22 33 unaffected 12 24 
idealistic 42 53 unassuming 7 11 
immature 17 24 understanding 68 80 
impulsive 32 45 unrealistic 4 9 
informal 52 60 warm 48 66 
inhibited 16 23 whiny 2 6 
kind 71 79 wholesome 38 57 
loyal 68 79 worrying 40 50 
mannerly 63 75 zany 9 14 
meek 11 17 
mischievous 34 43 
modest 42 54 1 
moody 42 54 
nagging 5 13 
natural 45 65 
nervous 32 m 
noisy 9 14 | 
optimistic 56 65 
outgoing 34 47 
patient 36 43 
planful 31 39 
pleasant 57 69 
poised 26 42 
Contraindicative Items (Masculine, N = 39) 
aggressive 41 23 clear-thinking 68 
argumentative 56 47 [nd gs 37 
arrogant 12 7 coarse 7 3 
autocratic 12 6 confident 58 40 
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TABLE 1— Continued 
cool 32 22 shrewd 23 11 
deliberate 39 29 sly 11 6 
dissatisfied 31 26 stern 14 8 
egotistical 22 14 stolid 6 3 
enterprising 43 32 strong 38 25 
forceful 19 11 tough 13 5 
foresighted 43 37 unemotional 10 6 
handsome 23 2 unexcitable 6 3 
indifferent 16 11 vindictive 8 4 
ingenious 21 15 wise 33 27 
inventive 32 25 
masculine 67 1 
opportunistic 32 22 
pleasure-seeking 62 54 
progressive 50 43 
resourceful 48 41 
rigid 7 3 
robust 14 9 
self-confident 49 36 
sharp-witted 41 29 
show-off 19 13 


Note. —All sex differences significant at p < .00005, based on comparison of N = 2805 males and 
N = 2212 females. 


Inspection of the adjectives contained in Table 1 reveals what 
would appear to be several gross inconsistencies in self-description 
within female Ss. Specifically, over half (54%) of these Ss de- 
scribed themselves as “moody,” while 69 per cent saw themselves as 
“pleasant.” In effect, then, at least 23 per cent of the females were 
able to see themselves as being simultaneously moody and pleas- 
ant. A similar inconsistency occurred in regard to “sentimental,” 
endorsed by 80 per cent of the female Ss, and “hard-headed” which, 
while not an adjective included in the ACL Fem scale, was self- 
attributed by 53 per cent of the same Ss. A simultaneous endorse- 
ment of these two concepts, therefore, occurred with at least 33 per 
cent of the female Ss. A final example is “impulsive” (45%) and 
"responsible" (76%), indicating a simultaneous endorsement by & 
minimum of 21 per cent of the female Ss. The unavoidable conclu- 
sion must be that sizeable proportions of the females were able, 
without apparent stress, to juxtapose several pairs of logically dis- 
similar (and in some ways mutually exclusive) characteristics 
within the framework of a single self-concept. Tt is notable that no 
similar inconsistencies were found in the data from male Ss. 
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Because in the ACL format scale scores are correlated positively 
with total number of adjectives endorsed, a correction factor must 
be introduced. The procedure used by Gough and Heilbrun (cf. 
Gough and Heilbrun, 1965) was followed identically in this study, 
Four-category standard score conversion tables were calculated 
for the ACL Fem scale, based upon the original sample of 2,212 
females and 2,805 males. Tables 2 and 3 present the standard score 
conversion data. Fundamentally, a high Fem scale score indicates a 


TABLE 2 
Conversion of Raw to Standard Scores for ACL Femininity Scale 


Males 


Total Number of Adjectives Endorsed 


1-75 76-95 96-121 122-300 


Raw Standard Scores 

-23 1 

-22 2 

-21 8 ul 

—20 5 2 

—19 6 3 

—18 8 5 1 

-17 9 6 2 1 

—16 10 7 3 2 

—15 12 8 4 3 

—14 13 9 5 4 

—13 14 11 6 5 

—12 16 12 8 6 

=11 17 13 9 8 

—10 19 14 10 9 

2:29 20 16 11 10 

-8 21 17 13 11 

T 23 18 14 12 

2.9 24 19 15 13 

no 26 20 16 14 

=e 27 22 17 15 

its 28 23 19 16 

-2 30 24 20 17 

ipn 31 25 21 18 
0 32 27 22 19 
1 34 28 23 21 
2 35 29 25 22 
3 37 30 26 23 
4 38 32 27 24 
5 39 33 28 25 
6 41 34 29 26 
7 42 35 31 27 
8 44 36 32 28 
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TABLE 2—Continued 


98 88 
99 89 
100 90 


RAIVASSRISRES 
e 


TABLE 3 
Conversion of Raw to Standard Scores for ACL Femininity Scale 
ee 
Females 
Total Number of Adjectives Endorsed 


1-78 79-98 99-119 120-300 


Raw Standard Scores 
Imp SE — ——  — 

—16 1 

-15 2 

-14 3 

218 4 

-12 5 

-n 1 

~10 8 

250 9 

568 10 

B n 

-6 13 

zu 14 

-3 16 3 1 

=g 18 4 2 4 

Eo 19 5 3 1 
0 20 7 4 2 
1 21 8 5 3 
2 28 9 7 4 
3 24 11 8 5 
= 25 12 9 7 
5 26 14 10 8 
6 28 15 12 9 
T 29 16 13 10 
8 30 18 14 n 
2 31 19 15 12 
10 33 20 17 13 
M 34 22 18 m 
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35 23 
36 24 
38 26 
39 27 
40 28 
41 30 
43 31 
44 33 
45 34 
46 35 
48 37 
49 38 
50 39 
51 4l 
53 42 
54 43 
55 45 
56 46 
57 47 
59 49 
60 50 
61 52 
62 53 
64 54 
65 56 
66 57 
67 58 
69 60 
70 61 
71 62 
72 64 
74 65 
75 66 
76 68 
77 69 
79 71 
80 72 
81 73 
82 75 
84 76 
85 77 
86 79 
87 80 
89 81 
90 83 
91 84 
92 85 
94 87 
95 88 
96 90 
97 91 
99 92 
100 94 
95 
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TABLE 3—Continued 


66 96 86 76 
67 98 87 77 
68 99 88 78 
69 100 89 79 
70 91 80 
71 92 81 
72 93 82 
73 94 83 
74 96 84 
75 97 86 
76 98 87 

99 88 
78 100 89 
79 90 
80 91 
81 92 
82 93 
83 94 
84 96 
85 97 
86 98 
87 99 
88 


tendency to endorse ACL items in the feminine direction while à 
low Fem scale score indicates a tendency to respond to the ACL 
items in the masculine direction. The highest possible raw score on 
the Fem scale would be 94 (endorsement of all feminine adjectives 
and none of the masculine items). Correspondingly, the lowest raw 
Score possible would be —39 (endorsement of all masculine adjec- 
tives and none of the feminine items). 

Validational information for the ACL Fem scale, based on its re- 
lation to the MMPI Mf, CPI Fe, and GZTS M scales, is summarized 
in Table 4. All correlation coefficients are in the expected direction 
and all are statistically significant. Moreover, the magnitudes of the 
correlation coefficients between the ACL Fem scale and the other 
masculinity-femininity scales are similar overall to the intercorre- 
lation coefficients among the other maseulinity-femininity scales 
themselves. From a construct validation Point of view, the pat- 
tern of intercorrelations among these four scales suggests that the 
ACL Fem scale may offer as much to the understanding of mascu- 
linity and femininity in self-description as the existing scales have 
thus far. Additionally, the rather modest magnitude of all of these 
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TABLE 4 
Intercorrelations of ACL Fem Scale and Other Measures of 
Masculinity and Femininity* 

ACL-Fem MMPI-Mf GZTS-M 
M: 30* 

F: —34** 
M: —32** —49** 

E: 21 —19* 

M: 41** 41** —49** 
F: 21** —94** 37** 


—Male N = 147; female N = 160. 
omitted. 


ents suggests that research with psychometrically-defined 
ponse variables might benefit from utilization of combina- 
of these scales for future studies. 

Tt is relevant at this point to note that it was anticipated that 
X discrimination and scale validity might be improved with the 
of even more stringent criteria for item selection. Two addi- 
criteria were used in developing a preliminary version of the 
scale: first, regardless of the statistical significance of the sex 
nee, an adjective must have been endorsed by at least 20 
cent, or no more than 80 per cent, of one of the sexes. Second, 
must have been a difference of at least 10 per cent between 
proportions in item endorsement. These criteria reduced the 
of scale items to 60 (47 feminine and 13 masculine). How- 
scores on this preliminary version of the Fem. scale bore 
ker predictive relationships overall with the other masculin- 
and femininity scales than did the final Fem scale version 

above in Table 1. 

ummarized in Table 5 are the statistically significant correla- 
‘between the Fem scale and other ACL scales. While more cor- 
tions were significant for female Ss (15) than for male Ss (10), 
general pattern of the coefficients is similar and the magnitudes 
‘the correlations indicate that the Fem is sufficiently independ- 


from other ACL scales to serve a useful function itself. The Fem 


shows significant negative relations with those ACL scales 
kinds of behavioral tenden- 


appear to typify more masculine | 
ie., Dominance, Self Confidence, Autonomy, Aggression. Sig- 
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TABLE 5 
Significant Correlations between ACL Femininity Scale 
and Other ACL Scales* 
ACL Scale Female Ss> Male Ss* 
Defensiveness 17* 
Favorable 21** 
Self Confidence -23** —30** 
Lability 16* 
Personal Adjustment 26** 
Achievement -21* 
Dominance —20** —35** 
Nurturance 42** 41** 
Affiliation 37** 
Heterosexuality 37** 
Autonomy —33** —32** 
ion —23** —26** 
Succorance 36** 39** 
Abasement 39** 44** 
Deference 40** 43** 
Counseling Readiness —27** 26** 
* Decimals omitted, 
bN = 160, 
eN = 147, 
*p «.05 
“Pp «01 


nificant positive relations also occur with regard to ACL scales 
defining response tendencies commonly regarded as being more char- 
acteristically feminine, ie, Nurturance, Abasement, and Deference. 
Given the descriptions of Scale-associated characteristics and re- 
lated inferntial data provided by Gough and Heilbrun (1965), the 
pattern of correlations in Table 5 is as would be anticipated. In 
only one case is there a difference in the direction of the correla- 
tions between male and female Ss: the relation between Fem and 
Counseling Readiness. This sex difference probably has a rela- 
tively straightforward explanation from both a theoretical and an 
empirical standpoint. The direction of the correlations indicates 


in a primarily masculine way. Correspondingly, the male whose 
self-concept includes a 


1 
b 


EE 
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to counseling responsiveness, the correlations are consistent 
research findings related to differences in counseling termina- 
behavior of male and female clients, reported by Heilbrun 
6la, b). 

s interesting to consider the nature of two groups of ACL items 
hich almost all Ss, regardless of sex, tended either to endorse or 
not to endorse as self-descriptive. The 16 adjectives which were 
most commonly endorsed by all Ss are shown in Table 6, from which 


] it can be seen that they uniformly embody highly socially-desirable 


TABLE 6 
* ACL Adjectives Most Frequently Endorsed by Both Sexes 


MEL LS —— 


| Proportion Proportion 
Adjective Male Endorsement Female Endorsement 


" adaptable* 91 
j ambitious* 

appreciative** 
capable 
civilized** 
considerate** 
cooperative** 
curious* 
dependable* 


SARSSSVSSRSSLS 


interests-wide* 
sincere** 
understanding** 


SLESSVSRARSSSRRSES 


BB 


B undemtandng^ 68 


* Bex difference sig. p « .05. 
Sex difference sig. p < .00005. 
Italics indicate adjective not part of ACL Favorable item scale. 


characteristics. In fact, except for the single item “civilized,” the 

verlap with the ACL Favorable item scale is complete. In con- 

rast, Table 7 presents the 45 ACL items which were rarely en- 
dorsed by either sex. It can be seen that these adjectives are gen- 
erally uncomplimentary and, for the most part, describe socially- 
Undesirable personal characteristics. Comparison of Table 7 with 
ACL Unfavorable item scale reveals that two-thirds of the infre- 
quently endorsed items appear in the Unfavorable item scale. 

Tt seems likely that the adjectives in Table 6 could serve a useful 
function in self-description with the ACL parallel to the F scale 
function with the MMPI, or the Commonality scale of the CPI. It 
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TABLE 7 
ACL Adjectives Least Frequently Endorsed by Both Sexes 


Proportion Proportion Proportion Proportion 
Male Female Male Female 
Endorse- Endorse- Endorse- Endorse- 
Adjective ment ment Adjective ment ment 

blustery* 3 2 slipshod 2 2 
coarse** 7 3 sly** 11 6 
cowardly** 6 13 smug 5 4 
cruel 6 4 snobbish** 5 9 
deceitful 7 6 spineless 3 3 
despondent 7 7 stolid** 6 3 
effeminate** 2 6 thankless 4 3 
frivilous** 7 12 tough** 13 5 
hard-hearted 7 5 unambitious 4 3 
hostile 5 4 unassuming** 7 11 
infantile 1 2 undependable 3 3 
intolerant 9 8 unexcitable** 6 3 
nagging** 5 13 unfriendly 2 2 
obnoxious* 5 3 unintelligent 2 2 
prudish* 4 14 unkind 2 1 
queer 1 1 unscrupulous 3 2 
quitting* 4 7 — unslable* 4 6 
rattlebrained** 3 9 vindictive** 7 4 
reckless* 10 7 weak* 4 7 
retiring* 6 8 whiny** 2 6 
rigid** 7 3 Biny 

rude* 5 3 

Severe* 7 5 

shallow 4 4 

shiftless* 4 2 


* Sex differences sig. p < .05. 


** Sex differences sig. p < 00005, 
* Italics indicate adjective not part of ACL Unfavorable item scale. 


is also possible that the infrequently endorsed items in Table 7 
could be used as critical items in the detection of serious adjustment 
problems. In view of the extreme infrequency with which they are 
endorsed among college Students, this group of adjectives could 
conceivably be particular valuable in assisting college counselors 


with their work. 
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GENERALITY OF RISK TAKING ON r 
OBJECTIVE EXAMINATIONS: 


MALCOLM J. SLAKTER 
State University of New York at Buffalo 


Risk taking on objective examinations (RTOOE) is defined as 
guessing when the examinee is aware that there is a penalty for in- 
correct responses (Slakter, 1967b). There is evidence that RTOOE 
is related to behaviors such as dominance-submission (Votaw, 
1936), maladjustment (Sheriffs and Boomer, 1954), vocational 
choice (Ziller, 1957b), and perception of risk in military situations 
(Torrance and Ziller, 1957). Since RTOOE measures can be ob- 
tained from Ss ostensibly taking achievement or aptitude tests, it 
would appear that RTOOE provides psychologists with a poten- 
tially useful disguised measure of risk taking. 

In addition, there are indications that Ss low in RTOOE tend to 
be penalized on their test score (Sherriffs and Boomer, 1954; Slak- 
ter, 1968; Votaw, 1936). These studies have demonstrated that 
when Ss who are low in RTOOE are forced to respond to all test 
items, their average test score will increase even though the usual 
penalty for guessing is applied. Since these findings provide evidence 
that RTOOE confounds the achievement or aptitude being meas- 
ured by the examination, RTOOE becomes of interest to individuals 
concerned with educational measurement. 

Results of various studies (e.g, Stone, 1962; Swineford, 1941; 
Slakter, 1967b) indicate that the RTOOE behaviors of individuals 
on a given test are impressively consistent. Generally speaking, no 
__ 
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matter which index of RTOOE is used, the measure of RTOOE pro- 
vides reliabilities at least as high as those provided by the aptitude 
or achievement dimension that the test was designed to assess. As a 
next step, therefore, one wonders whether RTOOE behavior is con- 
sistent across various kinds of test situations; e.g., different types of 
achievement and aptitude tests, different, types of items, varying 
difficulties, . ete. In other words, are the RTOOE behaviors of indi- 
viduals general across various kinds of testing situations, or are 
these RTOOE tendencies highly specific to the particular testing 
situation? 

For this study, the question of concern was the generality of 
RTOOE aeross different types of tests. Specifically, interest was 
centered on the values of the RTOOE correlations among the fol- 
lowing four types of tests: (1) language aptitude, (2) mathematics 
aptitude, (3) language achievement, and (4) mathematics achieve- 
ment, 

In the only previous study of the generality of RTOOE that the 
writer is aware of, Swineford (1941) administered four different 
tests to 457 high school freshmen. The four tests were described by 
Swineford as follows (1941, p. 439) : 

"One is a non-language test, Paper Form Board, in which each of 
the twenty-eight items consists of one of four geometrical figures cut 
into three or four sections. The subject is to determine which figure 
the sections would fit if they were reassembled, The second test, 
General Information, is a multiple-choice test of one hundred items 
covering factual information. The third is a fifty-item multiple- 
choice test of vocabulary. The fourth is a thirty-six item true-false 
test of logical deduction based on series of inequalities written in 
terms of letters of the alphabet.” 

Swineford’s index of RTOOE was introduced in an earlier study 
(Swineford, 1938), and will be symbolized by R,. Unfortunately, 
however, because of the nature of the R, index or the directions, 
many of the Ss had to be eliminated from the study. Swineford 
(1941, p. 439) stated: “Of the four hundred fifty-seven pupils who 
were tested, seventy-four boys and thirty-nine girls were elimi- 
nated from the gambling study either because on one or more of 
these tests no extra credits were requested, or because on one or more 
tests no errors were made among the items attempted." 

Swineford (1941) found that males displayed higher RTOOE 
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males, and R, was essentially uncorrelated with test score. 
respect to RTOOE correlations among the four tests, Swine- 
(1941) found that the correlations for male, female, and total 
group ranged from about 2 to 8, with little, if any, sex difference. 
"Swineford concluded (1941, p. 442): "Examination of the correla- 
“tions reveals that a G factor common to all the tests and an over- 
lapping factor between the verbal tests may be postulated.” Due to 

the awkward directions associated with Rs Swineford's index was 
nob used in this study i 


r 


" Procedure 
— 8s were students in the 8th grade, selected from four schools in 
d western New York State and two schools from a city in Ontario, 
- Canada. In one of the Canadian schools, classes were conducted only 
in English; in the other Canadian school, both French and English 
were spoken. In three of the New York State schools, and the bi- 
lingual Canadian school, a random sample of 100 Ss was selected 
í from the entire Sth grade class. In the remaining New York State 
school, it was decided to test the entire 8th grade class of 145 stu- 
dents. In the English-speaking Canadian school, since the 8th grade 
Was split administratively into two large sections, one entire section 
of 128 students was selected at random. As expected, in each case 
there was attrition due to absences, ete., so that the sample size in 
any particular school was smaller than that listed above. The exact 
sample sizes are found in the results section. 
Ss were seated in one large room, and the first of two test batteries 
- administered to them was the Standard Educational Intelligence 
Test (SEIT). The general directions for SEIT began with the sen- 
tence: THIS BOOKLET CONTAINS A TEST WHICH WILL 
GIVE YOU A CHANCE TO SHOW WHAT YOU KNOW AND 
* How WELL YOU THINK. SEIT was composed of two parts, the 
. "first supposedly a measure of language aptitude, and the second sup- 


| -posedly a measure of mathematical aptitude. The language section 
x 


contained 10 nonsense items embedded in 40 legitimate items; the 


mathematics section was composed of 10 nonsense items embedded 


in 30 legitimate items. In constructing the legitimate items for both 
de to select items that would be 


- parts of SEIT, an attempt was ma 
T less familiar to the Ss than those legitimate items on the correspond- 
- ing achievement tests. One measure 0 


4 


£ the success of this attempt 


zm 
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can be determined by a comparison of the difficulties of the aptitude 
and achievement tests in either the language or mathematics dimen- 
sion. The comparisons of these difficulties are examined in the re- 
sults and discussion section. 

The second test battery administered to the Ss was titled Standard 
Educational Achievement Test (SEAT), and the general directions 
began with the sentence: THIS BOOKLET CONTAINS A TEST 
OF SOME OF THE KNOWLEDGE YOU HAVE GAINED DUR- 
ING YOUR SCHOOL YEARS. Like SEIT, SEAT was composed of 
two parts, the first supposedly a measure of language achievement, 
and the second supposedly a measure of mathematical achievement. 
The language section in the SEAT booklet contained 10 nonsense 
questions embedded in 30 legitimate questions; the mathematics 
section of SEAT also consisted of 10 items embedded in 30 legiti- 
mate items. In both the language and mathematics sections of 
SEAT, an attempt was made to have all of the legitimate items 
cover material that the Ss had previously been exposed to in their 
formal schooling. 

Therefore, the administration of SEIT and SEAT provided 
RTOOE and conventional test scores on the following four types of 
tests: (1) language aptitude, (2) mathematies aptitude, (3) lan- 
guage achievement, and (4) mathematics achievement. Each of the 
four tests provided a measure of RTOOE on nonsense items, R’ 
(Slakter, 1967a), a measure of RTOOE on legitimate items, Rs 
(Ziller, 1957a), and a legitimate score, L. 

The administration time for SEIT and SEAT was 45 minutes 
each, for a total of 90 minutes. With an additional 10 minutes for 
passing out the tests, etc., the total testing time was approximately 1 
hour and 40 minutes. In all cases except one, the vast majority of 
Ss had ample time to finish the tests, In one instance, the testing | 

| 


n  —— Hm a Ge 


time had to be shortened by approximately five minutes, and the 
difference, if any, will be noted in the discussion section. 


Results and Discussion 


Table 1 presents the means and sample sizes for the six schools on 
the risk and aptitude or achievement measures, The aptitude or 
achievement scores (L) were calculated as the number right minus 
one-third the number wrong. Males, females, and totals are dis- 
played separately. The English-speaking Canadian school is la- 
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beled S4; the bilingual Canadian school is labeled S;. School So desig- 
nates the New York State school that was stopped approximately 
five minutes early, and thus had less time to finish than the other 
schools. 

The distributions of R’ and R, were examined for each of the 
four tests in each of the six schools. In each case, the distribution of 
the RTOOE measure was negatively skewed, with the mode at the 
extreme right. 

The correlations among the risk and aptitude or achievement 
measures for the six schools are provided in Table 2. The diagonal 
entries in the table are the split-half reliabilities (odd versus even, 
adjusted by the Spearman-Brown formula). From a visual compari- 
son of the correlation matrices for the males and females, it was ap- 
parent that the sex differences were negligible. Therefore, Table 2 
contains only the results for the total group. 

The finding that sex differences in the correlation matrices were 
negligible is consistent with the results found by Swineford (1941). 
More importantly, however, an examination of Table 2 revealed im- 
pressive evidence, across all six schools, for convergent and discrimi- 
nant validity as described by Campbell and Fiske (1959). A princi- 
pal components analysis with varimax rotation was performed on 
the correlations for each school. In each analysis, two clear factors 
emerged: one risk factor with heavy loadings on the eight RTOOE 
Measures, and one aptitude-achievement dimension with heavy 
loadings on the four conventional test scores. In the six schools, the 
Proportion of variance explained by the risk factor ranged from 44 
per cent to 48 per cent; for the aptitude-achievement factor, the 
Tange was from 24 per cent to 26 per cent. It appears, therefore, that 
R’ and R, measured essentially the same trait throughout the four 
tests, and that the RTOOE trait was distinguishable from the di- 
mension corresponding to the aptitude-achievement scores. Fur- 
thermore, while there was some RTOOE behavior specific to each 
Particular test, there was a large general factor that appeared in all 
four tests. The finding of this general factor is consistent with 
Swineford’s results ( 1941). 

It is important to note that the aptitude-. 
may not have been more than nominal. The general directions at 
the beginning of each test booklet attempted to distinguish between 
the aptitude (...A TEST WHICH WILL GIVE YOU A CHANCE 


achievement distinction 
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TO SHOW WHAT YOU KNOW AND HOW WELL YOU 
THINK) and the achievement examinations (. . - A TEST OF 
S0ME OF THE KNOWLEDGE YOU HAVE GAINED DURING 
YOUR SCHOOL YEARS). However, the distinction between the 
two types of tests is not always obvious. In Table 1, it is seen that in 
every school, the aptitude examination in mathematics or language 
was more difficult than the corresponding achievement examination. 
(Recall that there were 40 legitimate jtems on the language apti- 
tude test, and 30 legitimate items on each of the other tests.) There- 
fore, the attempt to have less familiar items on the aptitude tests 
appears to have been successful, and it is clear that the aptitude- 
achievement dimension is confounded with difficulty. Indeed, some 
may argue that the aptitude and achievement tests differ in diffi- 
culty only, and not in any aptitude-achievement classification. This 
writer would not take issue with such people. There is no doubt, 
however, that the four tests did differ with respect to content (lan- 
guage versus mathematics) and in difficulty, and that there was a 
large general RTOOE factor throughout these tests. 

Also of some interest was the tendency, across all six schools, for 
the legitimate measures to correlate negatively with the RTOOE 
measures. In other words, high propensities in RTOOE tended to be 
associated with low legitimate scores; low propensities in RTOOE 
tended to be associated with high legitimate scores. These negative 
correlations were not, in general, of great magnitude. However, pre- 
vious studies of the relation between RTOOE and legitimate meas- 
ures have either found no correlation (Swineford, 1941), or slightly 
Positive correlations (Slakter, 1967b). A possible explanation for 
this discrepancy might be due to the lack of testwiseness at the 8th 
grade level, and the resultant inability to profit by taking risks. An- 
other possible explanation is that some of the students, knowing 
that the test scores would not enter into grades, ete., answered all of 
the questions without attempting to select the correct answer. The 
latter explanation was investigated by examining several of the 
scatter diagrams after the elimination of all Ss who were assigned 
RTOOE scores of 1 (ie. high risk takers). Since the correlations 
still appeared negative, the conjecture that the negativeness was due 
to disinterested examinees was considered less plausible. The con- 
jecture concerning the lack of testwiseness at the 8th grade level will 
be investigated in future research. 
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In another interesting side issue, a schools by sex multivariate fac- 
torial analysis of variance was performed on the R’ vector; i.e., the 
set of dependent RTOOE measures was composed of R’ on language 
aptitude, R’ on mathematics aptitude, R’ on language achievement, 
and R’ on mathematics achievement. Since the cell frequencies were 
unequal, an exact least squares analysis was utilized (Bock, 1963). 
The computer program used for the calculations was prepared by 
Finn (1967). Using the .05 level, it was found that sex differences 
(with the effects of schools eliminated) were significant, school dif- 
ferences (with the effects of sex eliminated) were significant, but 
that the interaction (with both main effects eliminated) was not 
significant. It is important to note that Swineford (1941) found 
males to be higher on RTOOE than females, while the present data 
indicate just the opposite; i.e., females were higher on RTOOE than 
males. This discrepancy in results may be explained by the fact that 
Swineford’s Ss were in 9th grade, while the Ss of the present study 
were just entering 8th grade. Since very little is known of the rela- 
tion between RTOOE and age, it is possible that the difference in 
findings is merely a function of the age of the Ss. On the other hand, 
it should be remembered that in the Swineford study (1941, p. 439), 
approximately 25 per cent of the Ss were eliminated, Some of these 
Ss were eliminated because they neglected to ask for extra credit; 
Le, were low on RTOOE. Since twice as many males were elimi- 
nated as females, there is reason to suspect that the mean RTOOE 
score for males was spuriously high. 

! From an inspection of Table 1, it is seen that the English-speak- 
ing Canadian school (S,) was the lowest in RTOOE, with the bi- 
lingual Canadian school (34) also somewhat low in RTOOE. One 
might speculate that this result was due to differences in test sophis- 
tication, since the Canadian schools have been less exposed to objec- 
tive examinations than their United States counterparts. In order to 
determine if the sex or school differences in RTOOE might be ac- 
counted for by the differences in legitimate score, a schools by sex 
multivariate factorial analysis of covariance was performed on the 
R' vector, with the legitimate scores on the four tests used as the co- 
variates. As in the multivariate analysis of variance, the cell fre- 
quencies were unequal, and an exact least squares analysis was 
used. Also, as in the multivariate analysis of variance, significant 
differences were found for sex (with schools eliminated), for schools 
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th sex eliminated), but not for interaction (with both main ef- 
ts eliminated). With the use of covariance analysis, however, the 
ngual Canadian school appeared to be almost as low in RTOOE 
the English-speaking Canadian school. 


utes (Se), was relatively unaffected by the decrease in time. 

Tn brief, the major question of interest dealt with the generality 
of RTOOE across different types of tests. For various kinds of 
schools, it was found that a large, general RTOOE factor was pres- 
in the tests. In other words, Ss who tended to be high (low) in 
RIOOE on a mathematics examination, tended to be high (low) in 
. RTOOE on a language examination; Ss who tended to be high (low) 
“in RTOOE on aptitude (or difficult) examinations, tended to be 
high (low) in RTOOE on achievement (or less difficult examina- 


tions). 
"Therefore, there is some evidence that the specific type of testing 
ment of RTOOE. If 


E. was not too important in the measure! 
much is the case, the particular type of testing situation used for the 
"Measurement of RTOOE might well be decided on the basis of con- 
-Yenience. For example, the language tests were composed largely of 
Vocabulary items. This type of test is comparatively easy to con- 


for psychologists interested in a disguised measure o 
Seems reasonable to use these vocabulary type tests to measure 
. RTOOE, rather than (say) an appropriate mathematics test, which 
| Might take considerably more time to construct and more time to 
‘Administer. 

With respect to educational measurement, it has been previously 
‘Pointed out that several studies have demonstrated that when Ss 
Who are low in RTOOE are forced to respond to all test items, their 
Average test score will increase even though the usual penalty for 
Guessing is applied; i.e., ib appears that Ss low in RTOOE tend to be 


Penalized in test score. The present study would suggest that Ss low 


in RTOOE in one area (e.g, language achievement) would also 
mathematics achieve- 


_ tend to be low in RTOOE in other areas (e£; 
Ment), and hence would tend to be penalized not just in one area, 
p in many areas. Test constructors, a8 well as individuals low in 
- RTOOE, should consider the resulting implications. 
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Tum motivation for programmed tests has been given in previous 
papers (Cleary, Linn, and Rock, 19682, 1968b). A programmed test 
has been defined as a test in which the presentation of test items is 
contingent upon the previous responses of the examinee. Although 
not well suited to the present paper-and-pencil technology of the 
testing industry, programmed tests are being made more feasible by 
improved use of computer hardware. The flexibility and the possi- 
pes of reduced testing time are the major appeals of programmed 
ests, 

A dissertation by Patterson (1962) represents one of the attempts 
at investigating the advantages of a programmed test. Patterson 
used probability models and hypothetical populations for his analy- 
ses. In comparing the sequential branching tests with conventional 
cumulative test results he found that the sequential test discrimi- 
nated better at the extremes than did the conventional test. , 

Several members of the U. S. Army Behavioral Science Research 
Laboratory have followed the general approach taken by Patterson. 
Particularly relevant are a Technical Report by Waters (1964) and 
à Technical Report by Bayroff and Seeley (1967). As in Patter- 
Son’s work, Waters used a probability model and a hypothetical 
Population to simulate a branching test situation. In her work, 
Waters assumed a normal distribution of underlying ability and 
uniform tetrachoric item intercorrelations of .64. She found that the 
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branching test had a higher correlation with underlying ability than 
any of the conventional tests which she investigated. 

Bayroff and Seeley (1967) administered a verbal and an arith- 
metic reasoning branching test to 102 subjects. The branching tests 
were presented on a teletype machine and subjects responded by 
pressing an appropriate typewriter key. The branching tests con- 
sisted of either eight or nine items depending on the particular 
branch followed by an examinee, Branching was based on item diffi- 
culty. They also administered a conventional 50-item verbal test 
and 40-item arithmetic reasoning test. The correlations between the 
branching tests and their conventional counterparts were .78, and 
74 for verbal and arithmetic reasoning respectively. It was esti- 
mated that a conventional verbal test of 16 items and an arithmetic 
reasoning test of 19 items would be required to achieve the results 
observed for the branching tests. The above results are based on a 
sample of only 102 subjects so must be considered somewhat tenta- 
tive. 

An investigation of a two-level test was made by Angoff and Hud- 
dleston (1958). They found that the two-stage test was technically 
superior to a single test, but the margin of superiority was not suffi- 
cient to justify the adoption of the procedure in view of the admin- 
istrative problems that would be encountered. However, their focus 
was upon achieving an increase in reliability and validity rather 
than on reducing testing time while maintaining a given level of re- 
liability and validity. 

Cleary et al. (1968) used existing item data to investigate four 
methods of constructing programmed tests comprised of two major 
levels. The item pool consisted of 190 verbal items for which re- 
sponses were available for almost 5,000 subjects. The correlations of 
the programmed tests which used approximately 40 items were quite 
close to the reliability of the 190-item test. However, conventional 
tests of the same length consisting of the items with the highest 
point-biserial correlations with the total test score yielded part- 
whole correlations which were almost as high as the corresponding 
correlations for the programmed tests. 

The most promising of the four basic strategies in the above study 
(Cleary et al, 1968a) was modified and the data were reanalyzed 
(Cleary et al., 1968b). The modified procedure (called the three- 

group sequential method) required an average of 37.5 items per €x- 
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'and had a part-whole correlation with the total 190-item test 
the cross-validation sample that was slightly higher than 
‘the 50 items with the highest point-biserial correlations with 
-item test score. 
results of the previous studies by Cleary et al. (19682, 
ere limited to considerations of a single internal criterion: 
190-item test score. The purpose of the present paper is to 
the five methods previously investigated along with two 
methods against four external criteria. 


Procedure x 


response data were obtained for a total of 4,885 eleventh- 
students on the 190 verbal-type items of the SCAT and STEP 
These items test verbal aptitude, reading achievement, and 
g achievement, but were treated as belonging to a single test 
m which the programmed and conventional tests were con- 


4,885 students were randomly divided into two samples. The 
ample was used to select items for all tests and to calculate 
g and scaling weights. The first sample is referred to as the 
il sample throughout this paper. The second sample was re- 
| for cross-validation. 
January and February of 1967, a year and a half after the ad- 
tion of the SCAT and STEP tests, scores were obtained for 
tely two-thirds of the original students on the College 
| Achievement Tests in American History and English Compo- 
and the Verbal and Mathematics tests of the Preliminary 
ic Aptitude Test (PSAT). The third of the original sample 
ich the second set of scores was not available contain stu- 
rho had left school, changed schools in the interim period, had 
hable identifications, ete. 


mmed Tests 


, Seven different programmed tests were developed. For each 
appropriate items in the pool were scored as if the students 

n only those items in the order dictated by the given pro- 
ed test strategy. The students responded to the test items in 


ii 
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standard paper and pencil format but the programmed tests were 
scored as if the student had responded to the items in a programmed 
test format. Thus, the results are for simulated rather than actual 
programmed tests. 

Five of the programmed test methods consisted of two major sec- 
tions: (a) a routing section and (b) a measurement section. Approx- 
imately the same number of items were used for each of these two 
major sections in all five cases. Methods one to four have previously 
been reported in some detail (Cleary et al., 1968a) and the fifth 
method which involves a modification of the fourth has also been 
specified (Cleary etal., 1968b). The five routing methods were: 


1. Two-stage routing test: Ten items which are answered cor- 
rectly by approximately 50 per cent of the original sample 
were selected. Scores on these 10 items were then used to di- 
vide the sample into two groups of approximately equal size. 
"The above process was repeated for each of the two subgroups 
yielding a total of four groups. This routing procedure used a 
total of 30 items, 20 of which were used to route any given 
individual. 

. Broad range routing test: The proportion of the original sam- 
ple that chose the correct alternative for each of the 190 items 
was computed. Twenty items that had an approximately rec- 
tangular distribution of item difficulties over the observed 
range of item difficulties (.15 to .91) were then selected. The 
total number of correct responses on these 20 items was then 
used to divide the sample into four groups of approximately 
equal size. 

3. Group-discrimination routing test: The total number of cor- 

7 responses on the 190 items for each examinee was used to 
form four groups of examinees of approximately equal size 
in the original sample, and the proportion of subjects choos- 
ing the correct alternative to each item was computed within 
each of the four groups. The 20 items that had the largest 
ranges in the mean item scores from the low to the high group 
were then selected. (All items had strictly increasing means.) 
The total number correct for each examinee on the selected 
20 items was then used to form four groups of examinees of 
approximately equal size. 


N 


ROBERT L. LINN, ET AL. 133 


4, Four-group sequential item sampling: The 23 items with the 
highest point-biserial correlations with total test scores for the 
original sample were selected and arranged in decreasing order 
of the point-biserial correlations. (Twenty-three items were 
chosen in an attempt to make the average number of items 
scored for each person approximately 20.) On the basis of to- 
tal test score, the sample was divided mto four groups of ap- 
proximately equal size and the average proportion of correct 
choices across the above 23 items, P,, was computed for each 
group, g. For any given subject, the 23 items were then scored 
one at a time until the subject was assigned to a group with a 
fixed predetermined risk of misclassification or until all 23 
items had been used (see Cleary et al., 1968a for a detailed 
description). If all 23 items were used, the subject was as- 
signed to that group, g, with the P, that was nearest to his 
proportion of correct responses on the 23 items. The procedure 
used to route individuals into groups falls within the frame- 
work of the sequential sampling procedures developed by 
Wald (1950) and by the Statistical Research Group (1945) ; 
the specific procedure used was developed by Armitage 

| (1950). 

5. Three-group sequential item sampling: This method is a mod- 
ification of the four-group sequential and differs from it in 
two respects: the routing section was used to create three 
rather than four groups, and the maximum number of items 
scored in the routing section was 20 rather than 23. 


To develop the measurement tests, within-group point-biserial 
correlations between items and total test score were computed for 
each of the three or four ability groups of the original sample 
(groups formed on the basis of total test score). For each of the 
groups, the 20 items with the highest within-group point-biserial 
Correlations (excluding the 71 items that had been used for any of 
the routing tests) were selected from the remaining 119 items for the 
Measurement test for that group. i 

For each of the five experimental test procedures, two scoring pro- 
duin were investigated: one based on within-group coefficients; 
he other, on pooled coefficients. If Zpg is the score for person p who 


Was assigned to group g by a given routing test, then the two scoring 
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formulas were: ` 


Zo = a, + b, X,,, (1) 

Zz: =a, + bX,,, (2) 

where a, and b, are constants for group g, X, is the number correct 

for person p on measurement test g, and b is a constant. The total 

number correct on the entire pool of 190 items was used as the cri- 

terion to develop the constants for equations (1) and (2) in the 
original sample. 

The usual least-squares linear regression coefficients in the origi- 


nal sample were used as the a,’s and b,’s. A weighted average of the 
b,’s was used to compute the b's: 


b=x È nb, 8 


where n; is the number of persons in group g. 

In the initial study involving the four group programmed test 
methods (Cleary et al., 1968a) the scores defined in equations (1) 
and (2) were used as the programmed test scores. F. M. Lord (per- 
sonal communication) suggested that this is not making efficient use 
of the information in the routing section of the programmed test 
since no distinction is made between two examinees with different 
scores who are assigned to the same measurement test in terms of 
their scores on the routing test. In the present study the score on the 
routing section was added to Zp and the sum was used as the DIO: 
grammed test score. This change in scoring procedure resulted in 
very slight increases in the correlations with the 190-item total test 
Score (the increases ranged from .0007 to 0038). 

The last two methods differed markedly in their basic approach 
from the first five methods. They were modeled on a complete 
branching tree design rather than a routing section followed by 4 
measurement section. 

The sixth test was called the 10-item branching test. In this test 
branching occurred after each item according to whether the item 
response was correct or incorrect. Thus, the same item was scored 
as the first item for all individuals, but the second item was a some- 
what more difficult item for those who answered the first item cor- 
rectly and the second item was a somewhat easier item for examin- 
ees who answered incorrectly. This procedure continued until a total 
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- of 10 items were scored for any given individual. This procedure re- ` 
quired a total of 55 items. 

The actual selection of items for this pranching test was made by 
grouping items according to difficulty level corresponding to the 19 
possible difficulty levels required by the branching tree design. 
Those items with the highest point-biserial correlations with the 
total score within a needed difficulty level were then selected for the 
branching test. The point-biserials were based on the entire original 
sample, not on the specific subgroup for whom the item was used. 
The actual item difficulties are presented in Figure 1 which also pro- 
vides an illustration of the branching procedure. A correct response 
to the first item was scored 3.5 and an incorrect, response 0; the 
scoring weight was then incremented by .2 for each succeeding item 
on a correct branch and decremented by .2 for each succeeding item 
on an incorrect branch. The 10 possible tenth items had weights 
tanging from 1.7 for the easiest item to 5.3 for the hardest item; the 
possible score range was 0 to 39.5. 

The last programmed test considered in this paper was the item- 
blocks branching test. This second branching test was quite similar 


Figure 1. Item difficulties for 10 item branching test. 
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in design except that a total of 25 items were scored for each indi- 
vidual and branching occurred after each block of five items instead 
of after each item. Thus the same block of five items was scored as 
the first five items for all examinees; then those examinees who an- 
swered three, four, or five of the first block of items correct had a 
block of five somewhat more difficult items scored as items 6 through 
10, and examinees who answered zero, one, or two of the first block 
correct had a block of five somewhat easier items scored as items 6 
through 10. In all, this procedure required 75 items 25 of which 
were scored for any given examinee. Items were selected for this test 
in essentially the same way as they were in the 10-item branching 
test, that is, items were grouped according to difficulty level for the 
various levels of the blocks of items and then items with the highest 
point-biserial correlations with the total test score were selected for 
each level. 

The range of item difficulties and an illustration of the branching 
procedure is presented in Figure 2. Correct responses to items in the 
first block were scored 3.5, the scoring weight was then either in- 
creased or decreased by .5 for the second block of five items. The 
weights for the five possible final blocks of items were 1.5, 2.5, 3.5, 
4.5, and 5.5; the possible score range was 0 to 112.5. 


» 


Conventional Tests 


In addition to the seven programmed tests, five shortened con- 
ventional tests were scored for comparative purposes. These five 
short tests were overlapping and consisted of 10, 20, 30, 40, or 50 


Figure 2. Item difficulties for 25 item blocks of 5 branching test. 
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ith the highest point-biserial correlations with the 190-item 
score. 


uation with External Criteria 
e four tests that were «dministered in 1967 were used as the 


ide criterion variables against which the programmed tests were 
ited and compared with the shortened conventional tests and 
item total test. All students who had a given criterion test 
(ie. History, English, PSAT-V, or PSAT-M) were used to 
ute correlations of that criterion test with the programmed 
cores, the shortened conventional test scores, and the 190-item 
test score. Correlations were computed separately for the 
nal and cross-validation samples. 
e validities obtained for each of the programmed test scores 
the 190-item score were used to obtain estimates of the length 
a test which was parallel to the 190-item test would need to be 
a validity equal to that of the programmed test under 
ation. The equation used to estimate the length, L, is: 


Me — ru) (4) 


f. 
Lom DD Tre — Trilye 


le: ru = .97 is the split-half reliability of the 190-item test, Tto 

he correlation of the 190-item test with the criterion test, and Tc 

correlation between the programmed test and the. eriterion 

For a given programmed test eight separate estimates of L were 

ed, two for each of the four criterion tests, by considering the 

inal and cross-validation samples separately. The median of the 

estimates was then determined and compared to the actual 
of items scored for a student on a programmed test. 


Results 


ince different criteria were used for item selection for the various 
tammed and conventional tests, the distributions of item con- 
re compared. The numbers of verbal aptitude, reading 
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TABLE 1 
Distribution of Test Items by Item Content 


Item Content 
Programmed Tests Verbal Writing Reading Total 
n ae ee Ween. ading o4 — 
Two Stage 37 22 28 87 
Broad Range 34 15 28 77 
Group Discrimination 37 15 25 TT 
4 Group Sequential 31 15 34 80 
3 Group Sequential 41 12 24 TT 
10 Item Branching 23 10 22 55 
25 Item Block 5 Branching 21 24 31 76 
Conventional Tests 
10 Item 3 1 6 10 
20 Item 6 2 12 20 
30 Item 7 6 17 30 
40 Item 9 10 21 40 
50 Item 12 15 23 50 
Total Test 60 60. 70. 190 


In Table 2, the correlations of the various programmed test 
Scores and the shortened conventional tests with the 190-item test 
are reported. Results are reported for both of the scoring methods 
for the first five programmed tests, Thus the Two Stage (within) 
and the Two Stage (pooled) refer to the two stage routing test 
method scored by equations (1) and (2) respectively. Separate re- 
sults are given for the original and cross-validation samples and 
within each of these for all cases initially available and for only 
those cases for whom one or more of the criterion test scores were 
available. 

In general, the differences between the correlations based on all 
cases do not differ markedly from those based on cases with criterion 
test scores: only three of the pairs differ by as much as .01. Minor 
changes in the rank order of the correlations do occur between the 
two columns within a sample. 

Among the programmed tests the highest correlations are ob- 
tained for the two sequential test methods which are of about the 
same value as the 50-item conventional test although sequential 
tests reported an average of 41.1 and 37.5 items respectively for the 
four-group and three-group methods. Both the group discrimination 
test and the 25-item-blocks branching test have higher correlations 
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TABLE 2 


Correlations of Programmed and Shortened Conventional 
Tests with 190-Item Total Test Score 
—————————— 


Original Sample Cross-Validation Sample 


————— 


Cases with Cases with 
All Criterion All Criterion 

Test Cases Data Cases Data 
Programmed Tests (N = 2478) (N = 1730) (N = 2407) (N = 1644) 
Two Stage (within) .9360 .9349 .9308 .9287 
Two Stage (pooled) 9274 .9248 .9225 19204 
Broad Range (within) .9474 19443 “9391 9415 
Broad Range (pooled) 19461 19422 19379 ‘9408 
Group Discrim. (within) 9533 9540 "9468 < .9520 
Group Discrim. (pooled) .9520 .9523 .9452 .9510 
4Gp. Sequential (within) -9663 ‘9636 19602 9613 
4Gp. Sequential (pooled)  .9632 ‘9590 19569 "9568 
3Gp. Sequential (within) -9620 19572 9610 19559 
3Gp. Sequential (pooled) ^ .9533 10556 ‘9593 “9539 
10 Item Branching .8738 .8608 .8753 .8651 
25 Item Block5 Branching ^ .9506 .9440 .9485 .9410 
Conventional 
10 Item 8873 .8748 8825 8784 
20 Item “9313 19253 19237 -9210 
80 Item 19426 19408 19392 19402 
D Item 19538 19513 9508 9505 
V Tem 9615 9580 19596 9577 


ee a eae 


than the 30-item conventional test; the group-discrimination test 
Tequired 40 items, however. v 

The correlations of the four-group programmed test methods with 
the criterion test variables are presented in Table 3. Although the 
four-group sequential method had the highest correlations with the 
190-item test, the other four-group methods have some higher cor- 
relations with the outside eriteria particularly the History and Math 
tests, The group-discrimination method had the highest median cor- 
relations for the two samples. 

. The shortened conventional tests and the 190-item test correla- 
tions with the criterion tests are listed in Table 4. Only for the Eng- 
lish Achievement and PSAT-V correlations is there a monotonic in- 
Crease in the magnitude of the correlations as the number of items is 
Increased. With the exception of the correlations for the two-stage 
test with the PSAT-V in the cross-validation sample, all of the four- 
es programmed test correlations are higher than the correspond- 

correlations for even the 50-item conventional test. 
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TABLE 3 
Four-Group Programmed Tests: Correlations with Outside Criteria 
ee 


Achievement, PSAT 
Method B's History ^ English Verbal Math 
nr. ; Verbal — Mah — 
Original Sample 

(N = 1700) (N — 1649) (N = 1477) (N = 1460) 
Two Stage within .6693 . 7464 .7860 .6339 
Two Stage pooled .0774 .7509 . 7863 .6386 
Broad Range within .6739 .7445 .T874 .6364 
Broad Range pooled .0724 .7456 .7871 .6368 
Group Discrim. within .6772 . 7541 .8073 .6481 
Group Discrim. pooled .6696 .7407 .8018 .6421 
Sequential within .6495 -7518 .8000 .6302 
Sequential pooled .6341 . 7307 .7874 .6175 

Cross-Validation Sample 

(N = 1622) (N = 1584) (N = 1415) (N = 1405) 
‘Two Stage within 6777 «7456 -7659 .6231 
Two Stage pooled .6841 . 1527 . 7102 .6305 
Broad Range within .6938 «7522 «7907 .6356 
Broad Range pooled .6935 . 7528 . 7904 .6358 
Group Discrim. within .7053 . 1655 .8070 .6501 
Group Diserim. pooled 6992 -7598 8033 6520 
Sequential within .6755 - 1646 .1922 .6277 
Sequential pooled .6025 «7507 -7816 6194 
Se ee roo o 7816 61a 


With five exceptions, the correlations of the 190-item test are 
higher than the corresponding correlations for the programmed 
tests. A comparison of the group-discrimination test using within- 
group scoring weights and the 190-item test shows that the differ- 
ences range from +.0047 for the PSAT-M in the cross-validation 
sample to —.0318 for English Achievement in the cross-validation 
sample. The median difference is only —.0067. A similar comparison 
of the 50-item conventional test with the 190-item test yields dif- 
ferences ranging between —.0458 and —.0687 with a median differ- 
ence of —.0581. 

The correlations of the three-group sequential test with the cri- 
terion tests are presented in Table 5. This method requires an aver- 
Age of 37.5 items whereas the four-group methods require 40 items 
for each of the first three methods and 41.1 for the sequential 
method. The correlations of the three-group sequential test using 
pooled scoring weights are higher than those of the corresponding 
correlations of the 50-item conventional tests and compare favor- 
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TABLE 4 


Validity Data: Conventional Tests Consisting of Items with the Highest 
Point-Biserial Correlations with Total Test Score 


Achievement PSAT 

History English Verbal Math 
.7164 .5518 

-7665 * 
«7685 .5934 
- 7687 .5910 
.7137 5949 
.6765 «7793 .8225 .6476 

Cross-Validation Sample 

.5916 -6720 .7176 .5605 
.6268 .7206 .7599 .5909 
.6272 .7264 .1662 .5972 
.6275 .7342 .7690 .5934 
.6361 7376 -7750 .5956 
.7047 .7973 .8208 .6514 


four-group programmed test methods. The median dif- 
veen the three-group sequential test with pooled weights 
s and the 190-item test correlations is —.0088 with a 
f +.0044 to —.0221. This is one of the smallest median differ- 
any of the programmed tests considered in this paper. 

6 contains the correlations between the branching tests and 
ion tests. Although the 10-item branching test had lower 
with the 190-item test than did the 10-item conven- 


TABLE 5 
e Group Sequential Test: Correlations with Outside Criteria 
Achievement PSAT 
Me Care ae 1. 
History English Verbal Math 
Original Sample 
.6730 .7555 .8078 .6404 
-6809 «7640 -8131 -6467 
Cross-Validation Sample ` 
6944 .T671 -8020 -6388 


17006 7752 18064 .6432 
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TABLE 6 
Branching Tests: Correlations with Outside Criteria 


Achievement, PSAT 
Test History ^ English Verbal Math 
Original Sample 
10 Item Branching -5877 .6810 .7115 .5636 
25 Item Block 5 Branching .6447 +7320 - 7847 .6101 
Cross-Validation Sample 

10 Item Branching .6078 .7093 . 1300 .5803 
25 Item Block 5 Branching .6799 .7595 .1882 .6281 
ee NM o M MM cS 


tional test, seven of the eight correlations with the criterion tests are 
higher for the 10-item branching test than the corresponding corre- 
lations of the 10-item conventional test. 

The 25-item-block-5 branching test had correlations which were 
higher in all eight cases than the corresponding 50-item conven- 
tional test. Thus by scoring one-half as many items for each student 
higher correlations were obtained for the branching test than for the 
Conventional test. The median difference between the 25-item-block- 
5 branching test and the 190-item test was —.0322 with a range of 
—.0233 to —.0473. 

A general comparison of the various programmed test methods 
from a somewhat different frame of reference is presented in Table 
7. The eight correlations of each programmed test with the criterion 
tests were used to compute estimates of the length that would be re- 
quired for a test that was parallel to the 190-item test to have the 
same validity. Equation (4) was used for these calculations. An 
entry in the column headed L in Table 7 is the median number of 
items required to obtain the eight correlations with criterion tests 
for each programmed test according to equation (4). 

The second column of Table 7 lists the number of items (I) ace 
tually required for each of the programmed tests. The third column 
is just the ratio of L to I. This ratio provides an indication of the 
relative saving in test length which is achieved by each of the pro- 
grammed test scores, A ratio of 1.0 would represent no saving, à ra- 
tio greater than 1.0 indicates a saving and one less than 1.0 indicates 
a loss. Only the four-group Sequential method with pooled scoring 
weights showed a loss (L/I = 95). The group-discrimination test | 
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TABLE 7 


Programmed Test Length (I) and Median Estimate of Test Length Required (L) for 
Equal Validity Where Test Is Parallel to the 190-Item Test 


t 
~ 


Programmed Test L I 


Two Stage (within) 
Two Stage (pooled) 


46.5 40 1.16 

49.6 40 1.24 
Broad Range (within) 58.3 40 1.46 
Broad Range (pooled) 58.4 40 1.46 
Group Discrim. (within) 134.6 40 3.36 
Group Discrim. (pooled) 93.2 40 2.33 
4 Gp. Sequential (within) 51.7 41.1* 1.26 
4 Gp. Sequential (pooled) 39.0 41.1* .95 
3 Gp. Sequential (within) 19.8 37.5* 2.13 
3 Gp. Sequential (pooled) 99.0 37.5* 2.64 
10 Item Branching 16.5 10 1.65 
25 Item Block 5 Branching 43.9 25 1.76 


^ Mean number required. 


with within-group weights and the three-group sequential method 
with pooled weights had the most favorable ratios (3.36 and 2.64 re- 
spectively). The two branching tests also had relatively high ratios: 
1.65 for the 10 item and 1.76 for the 25 item. 
Discussion 
The results of this study suggest that substantial savings in test 
length might be achieved by using some of the programmed test 
procedures considered here. The validity of this conclusion obvi- 
ously rests upon the assumption that responses to an item do not de- 
Pend upon the context in which that item is presented. That is, the 
programmed test scores and the shortened conventional test scores 
Were computed as if the examinees had actually taken the items in 
the implied order. Although studies of item rearrangement suggest 
that, under essentially nonspeeded test conditions, the item statistics 
and test correlations with other variables are not significantly af- 
fected by the arrangement of the items (Mollenkopf, 1950; 
: Flaugher, Melton, and Myers, 1968), the approach used in the pres- 
ent study is certainly no substitute for investigations involving a- 
tual programmed tests. 
$ The choice of a programmed test procedure would be influenced 
by the choice of the criterion. In terms of reproducing the 190-item 
total test score, the three- and four-group sequential methods and 
the 25-item-block-5 branching tests seem most promising, but none 
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of them was markedly superior to the shortened conventional tests, 
In terms of the four outside criterion tests the generally best pro- 
grammed tests were the group-discrimination, the three-group se- 
quential, the 25-item-block-5 branching test, and the 10-item 
branching test, all of which showed substantial gains when com- 
pared to the shortened conventional tests or to the test length re- 
quired for a test which is parallel to the 190-item test. 

The fact that the group-discrimination test showed so favorably 
in terms of the outside criterion tests could be of considerable prac- 
tical significance. It is the only one of the four best methods that 
could be readily implemented within existing paper-and-pencil test- 
ing technology since it requires only one branching decision. While 
it is conceivable that the other three methods could be implemented 
in some form of paper-and-pencil testing they would be very cum- 
bersome and are much better suited to the use of some sort of testing 
machine or remote console connected to a central processor. 

The work reported in this paper is still quite exploratory in na- 
ture. As previously noted the use of existing data is not a substitute 
for actual programmed test administrations. Besides this limita- 
tion, the procedures for item selection were not optimal. It would 
have been preferable to use estimates of item characteristic curve 
parameters in place of the item difficulties and point-biserial corre- 
lations which were actually used. Of possibly even greater impor- 
tance is the fact that the items were not designed for use in a branch- 
ing test, In any event, the results seem sufficiently promising to en- 
Courage further research on programmed tests. 


Summary 


. The use of a programmed test in which a subject is directed to 

items appropriate to his ability on the basis of previous responses 
Tepresents a promising approach to the reduction of testing time 
while maintaining a given level of validity. Seven programmed test 
procedures were investigated using existing data on 190 verbal-type 
items for 4,885 students. Half of the sample was used to select items 
for the various programmed tests, and the selected items were then 
scored as if the subject had been presented only those items in the 
implied order. For Comparative purposes, shortened conventional 
tests consisting of the K items with the highest point-biserial corre- 
lations with the total 190-item test score were constructed. The 
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Various programmed tests and conventional tests were then evalu- 
ated by (a) comparing their reproduction of the 190-item total test 

gore in the cross-validation sample, and (b) comparing their corre- 
lations with four criterion tests on which scores had been obtained 

| for about two-thirds of the initial subjects about one and a half 
“years after the initial testing. 

Five of the seven programmed tests considered in this paper com- 
prised two major sections: (a) A routing section which contains the 
branching necessary to direct the subject to the appropriate items, 
and (b) a measurement section which contains a short test with item 
dificulties concentrated at the appropriate level for the subject. The 
other two programmed test procedures involved more complete 
branching trees. One of these was a 10-item branching test with 

— branching occurring after each item according to whether the item 
Was correct or incorrect. The second branching test scored 25 items 
for each subject and branching occurred after each block of five 
items according to the number correct. 

_ Against the criterion of reproducing the 190-item total test score 
in the cross-validation sample the programmed tests were found to, 

" be only slightly superior to the shortened conventional tests. How- 
ever, the programmed tests had correlations with the outside cri- 
terion tests that were substantially higher than the corresponding 
shortened conventional tests. It was estimated that a test which was 
parallel to the 190-item total test would have to be 3.36 times as 
long as the best programmed test to have an equal median correla- 
tion with the outside criterion tests. 
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ACCURACY OF SELF-REPORTED COLLEGE 
GRADE AVERAGES AND CHARACTERISTICS OF 
NON AND DISCREPANT REPORTERS 


BARBARA A. KIRK an» LYNN SEREDA 
University of California, Berkeley 


Mucu of the research in the areas of student characteristics and 
prediction of academic success utilizes grade as a means of defining 
descriptive samples, or as criteria. Particularly on large scale re- 
search involving questionnaire data the question arises of the extent 
to which a student’s self-report may be relied upon, and so utilized 
insuch research. It is ordinarily very costly to verify the report data 
on officially recorded grades. 

The present study was not intentionally designed to bear upon 

lhis question. Relevant data had been collected in the process of 
ther research, and since there was found to be a sparsity of appro- 
priate research on the accuracy of grade reporting, it was decided to 
investigate this question. It is recognized that these data may have 
limited generalizability to other institutions or curricula. 

In spite of differences in samples and the academic level for 
which grades were reported, e.g., high school (Perry [1940] and 
Ash [1947]) and varying levels of college (Dunnette [1952] Black 
[1962] and Walsh [1967]), and differences in the format of the 
Teported data—correlations, percentage agreement, and mean dif- 
ference—as a whole, the data reviewed seemed to suggest a sub- 
stantial relation between actual and selí-reported grades. Perry 
(1940, reports an r = .826 for students with actual grades above 80 
Per cent and an r = .664 for students with actual grades below 80 
Per cent; Dunnette (1952), reports an r = .94 for the overall sample. 


However, of the 62 students whose honor point averages were be- 
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low 1.20, only 32 reported correctly, whereas, 27 of 28 students 
averages above 2.00 reported accurately. 

Thus it appears that the relation between actual and self-repo 
grades is more accurate for upper elassmen. Further (Walsh 19 
accuracy does not, appear to be easily affected by incentives t 
port inaccurately or by the method used to obtain the report, ei 
interviews, questionnaires, or personal data blanks, 

In addition, Walsh (1967), gives percentages of subjects re 
ing grades in exact correspondence with actual grades for t 
items, high school GPA., overall average at the State Universit 
Towa (S.U.L) and previous semester at S.ULL, as ranging fro 
per cent to 56 per cent, 

There is a suggestion that in situations or circumstances where| 
Student has the opportunity or need to enhance his self-report, 
the student is more likely to report inaccurately or distort. In a 
tion to Perry's and Dunnette’s reports of greater accuracy 
higher grades, Black (1962) found, that students fed back i 
curate class examination sub-score totals were more likely to lel 

» their marks high than correct them, or to correct them only if t 
were too low, and closer examination of Perry's (1940) scatte 
of actual by self-reported grades showed that when students’ 
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icating directions of respective sample bias in instances where 
ses are being made on the basis of self-reported data. 


Samples and Procedures 


umber of assessment measures along with a 32 item question- 
Vere administered by the Counseling Center to a group of Col- 
0f Environmental Design students at the University of Cali- 
| at Berkeley, as a part of a larger personality description 


|. Item number 8, of the questionnaire, was worded as follows: 
." Grades at 


p results to the appropriately concerned academie departments. 
stual cumulative grade point averages at Berkeley were ob- 
d from the files and recorded for all subjects. 44.3 per cent of 
male and'54.0 per cent of female students registered in Architec- 
and Landscape Architecture at the University of California, 
eley, for the spring 1966 semester, had turned in completed 
ionnaires. Of the 382 males who returned questionnaires, 328 
5.9 per cent reported their GPA’s and 38 of 48 or 80.0 per cent 
females reported theirs. The nonreporter sample consisted then 
he 54 of 382 or 14.1 per cent, male subjects and the 10 of 48 or 
per cent female subjects. 

e discrepant reporter sample consisted of all subjects who re- 
ed a grade point difference from actual grades greater than or 
jal to 3 in either direction. This discrepancy size represents a 
dard score of +.63 for males, and =.88 for females computed on 
basis of the total reported grade point average sample. Thus 
were 28 or 8.5 per cent males and 3 or 7.9 per cent females who 
discrepant reporters. Female subjects were not included in the 
uent reported analysis because of the sparsity of their number 
the likelihood of sex difference bias. 

on and discrepant reporter samples were then compared to the 
sampled male population of 382 architects and Jandscape 
itects on those variables that seemed of potential relevance to 
ining sample differences. These included five 

bles ftom the questionnaire: transfer background, mother’s 
father’s occupational and educational levels, and scores ob- 
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tained on the School and College Ability Test, (SCAT), Minnesota 
Paper Form Board (MPFB), and Omnibus Personality Inventory 
(OPI). 


Results 


At the general comparison level the data of Table 1 demonstrate 
substantial positive correlations between actual and reported grade 
point averages for each of the respective samples as well as a slight 
but consistent trend for reported grades to be higher than actual 
grades. 

43.2 per cent of the total sample reported exactly accurate grades, 
the percentage of accuracy increasing to 77.8 per cent if discrepan- 
cies of +.1 grade points were considered as reasonably accurate. The 
largest reported discrepancy was .7 grade points. There were more 
reported grade discrepancies in a positive-upgrading direction, 
43.8 per cent, than in a negative-downgrading direction, 13.0 per 
cent. For discrepancies of X— 2 grade points, only 2.1 per cent of 
the subjects downgrade their actual grades by inaccurate reporting, 
too small a sample to justify the separate study of downgraders in 
this research. In the nonreporter sample actual GPA average differs 
from the overall sample mean GPA, t = 2.81, p < .01, the nonre- 
porters having a lower GPA. 

Utilizing the mean self-reported total sample GPA 2.70, as a cut- 
off point for actual grades, 59.8 per cent of the lower classmen re- 
ported diserepantly, whereas only 40.2 per cent of the upper class- 


TABLE 1 
Correlations of Reported Grades with Actual Grades by Samples 


Reported Grades Actual Grades 


Group Number Mean S.D. Mean S.D. (o 
Soup Number Mean SD. Mean SD. 7 
Freshmen Architecture 44 2.59 — .49 2.55 .50 .95 
Sophomores Architecture 58 — 2.06 .45 2.00 .45 97 
Junior Architecture 201910:0:720098540/11,2:64 5:0. .43 93 
Seniors Architecture Sao J dz. . 44 96 
Landscape Architects 48 2.73 .46 2.05 .46 .93 
Graduates 4 3.75 150 3.75 1.50 1.00 
Total Sample a 2m Ms 5 a . 47 96 


Nonreporter sample (subjects 
whose actual GPA was 
available) 26 2.40 Al 
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men reported discrepantly, these results confirming the reports of 
Perry (1940) and Dunnette (1952) that lower classmen tend to be 
somewhat less accurate reporters than upper classmen. Closer in- 
spection of the scatterplot of actual by self-reported grades sug- 
gests a slight bimodal tendency for the positively-upgraded reports, 
which further suggests that the likelihood of discrepant self-report- 
ing increases near the boundaries of letter grade cut-off points. That 
is, if a self-reported discrepancy will mean a change of grade cate- 
gory from C to B or from B to A, there appears to be a greater like- 
lihood of an upwardly discrepant self-report. 

Comparing across nondiscrepant and overall samples, no differ- 
ences attributable to the contingency of direct from high school ad- 
mission versus college transfer admission are apparent. 

Father’s and mother’s occupational levels were classified accord- 
ing to the Dictionary of Occupational Titles. While low frequencies 
forced collapsing of cells, chi-square tests for independence resulted 
m no significant differences for either father’s or mother’s occupa- 
tional level when discrepant and non-reporter samples were com- 
pared to the total sample. 


TABLE 2 


Mean Comparison of Parents Educational Background on Check 
Question #17a and 17b, Scaled 1-Lowest to 9-Highest 


N 


Nonreporter Discrepant 

Total Sample Sample Sample 

Total Sample; 8D eee 

N M N M N M 

Father's educational level — 308 5.7 48 6.0 27 5.3 
Others educational level — 371 5.2 51 5.5 27 4.5 


Father's and mother's educational levels categorized on & nine 
mue item ranging from no formal schooling or some grade 
Fd t professional or graduate degrees resulted in the mean 
ow Shown in Table 2. All possible combinations of paired means 
bg compared by “t” tests when sub-samples were compared to 
th other, and by “z” tests when the sub-samples were compared to 

e total sample (population) since sub-sample data are included in 
I m sample data. An interesting pattern of differences appears 

ese variables in that the discrepant reporter sample indicates 
Parent educational levels lower than the mean, and the nonreporter 
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sample higher. This difference is especially apparent for mother's 
educational level where a significant (p « .01) difference is found. 
Although these educational levels are unverified self-reports, there 
appears to be a hint of grade concealment when parents have a high 
level of education and upward distortion when parents' educational 
background is at a less than average level. 

The question of a possible relationship between aptitudes and 
non or discrepant reporting was explored utilizing the SCAT, Form 
UA, and the MPFB. Mean scores and results of the paired compari- 
sons are presented in Table 3. 


TABLE 3 
Comparison of Mean SCAT (Form UA) and MPFB Scores 
Nonreporter Discrepancy 
Total Sample Sample Sample 
N M N =31 N=14 
SCAT (raw scores) 
Verbal 203 36.4 34.7 36.1 
Quantitative 203 33.9 33.0 30.5 
Total 201 71.9 67.7 66.6 
MPFB 211 52.2 51.2 52.9 


The samples are essentially undifferentiated, the one significant 
comparison (p < .01) being between the nonreporter and discrepant 
reporter samples, on the Quantitative section of the SCAT. This 
difference, however, is important to note, is in the direction of the 
discrepant sample being somewhat less able quantitatively than 
the nonreporter sample. There are nonsignificant trends of the non- 
Teporter sample being lower on the verbal section of the SCAT and 
of both samples being below the overall sample on the total SCAT 
score, 

i The final comparison was made utilizing the Omnibus Personal- 
ity Inventory, form Fx. The respective sub-sample Ns on which 
these comparisons were made were reduced considerably because 
not all of the subjects in the total available sample took this test. 
The interesting finding is that although there are no large overall 
differences when either the non or discrepant reporter samples are 
compared to the total sample, the trends which are present do be- 
come clearer when these two samples are compared with each other. 
That is, the non and discrepant reporter samples appear to be drawn 


£40 ote pe cee 


SU) 


E ie eae 


Total sample, N = 132-135 —— 
Nonreporters, N = 20 perii 
Discrepant reporters, N — 9 


Figure 1. Mean profile and comparisons on the OPI Form (fx). 


from opposite poles of the total sample variance on a number of 
Personality variables. 

y R discrepant reporters score significantly higher on the thinking 
E ation, aesthetic, complexity, and autonomy scales (p < 01), 
ee lower on the social extroversion (p < .01) and masculinity 
Bus (p < .05) than the nonreporters. There are slight nonsignifi- 
i ont for the discrepant sample to be lower on the personal in- 
hee ion and practical orientation scales. The only slightly signifi- 
ius peine by total sample comparisons are for the discrepant 
eus e to be higher on the complexity scale (p < .10), and for the 
me sample to be lower on the altruistic man scale (p < 10). 
e p the nonsignificant trends of both sub-samples being lower 
tie e overall sample on the anxiety level, altruistic man and 

Ponse bias scales. 
com it appears that discrepant reporters are more intellectually 
aesthetically oriented with a slightly poorer personal integra- 
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tion level, while the nonreporters are more oriented to a masculine- 
social role. 
Discussion 

This study substantiates the findings of previous research in dem- 
onstrating a high positive relationship between actual and reported 
grade point averages. Students with higher grades, and those who are 
in upper division, report more accurately. Those who report inac- 
curately tend to report on themselves in a somewhat more favorable 
light. 

The results also suggest a distinction between the discrepant self- 
reporter and a noncomplying self-reporter. 

The nonreporters have significantly lower grades than those who 
report. They come from a background in which the parents, espe- 
cially the mother, are better educated. However, they are generally 
less intellectually disposed. Further, the fact that nonreporters score 
lower on the AM scale suggests feelings of lesser compunction about 
altruistic behavior and in the context of their lower intellectual dis- 
position this might be extended to indicate less need to comply with 
the self-report instructions. 

Discrepant reporters, on the other hand, are somewhat more in- 
tellectually involved and consequently likely to be more ambitious. 
They appear to be less social than the group as a whole (SE and 
PI), so that their needs in the achievement area are especially 
strong. This suggests a special concern for grades fostered in a con- 
text of high parental standards and expectations, yet coupled with 
the parents’ own lack of an intellectual orientation, 

In considering the use of self-reported grade data, it is apparent 
that if nonreporters are excluded the population description will 
tend to be biased in the direction opposite to the characteristics that 
this nonreporter sample has been shown to have. On the other hand, 
1 discrepant reporters are included they will only bias population 

eseription on those variables which are inaccurately reported. 
However, if discrepant Teports are included within a dependent vari- 
able measure they will tend to bias other related independent com- 
parison variables which are accurately measured in the direction of 
a regression toward the mean. This effect can be particularly well- 
illustrated with a dependent variable such as grade point average 
Subjects who upgrade their actual grades will be included in highet 
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categories of academic success than those in which they actually be- 
long on the basis of their self-report and if their (independent) abil- 
ity or performance measures are accurate, these measures will still 
reflect a poorer than reported grade student. Thus, such a discrepant 
reporter will bias the sample of better students in a poorer student 
direction so as to minimize differences on independent variables 
when comparing across categories of the dependent variable, in this 
case GPA. It can be similarly demonstrated that the downgrading 
reporter student will minimize differences on independent variables 
meant to differentiate poorer students because the accurate inde- 
pendent measures are really reflective of a better than reported stu- 
dent. The net result is a regression effect on the independent varia- 
bles meant to differentiate on a dependent variable when discrepant 
reports on the dependent variable are included. Size and composition 
of samples, and purposes of investigations dictate to what extent 
self-report data on grades may be acceptable. ; 

Further clarification of differences between discrepant and non- 
reporters appears to be of sufficient merit, from both a practical 
and clinical standpoint, to warrant continued research. In the mean- 
time, caution should be exercised in the making of population infer- 
ences when self-report data are utilized. 
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In view of the tremendous advances that have been made in 
the adaptation of electronic computers and accounting machines 
to the processing of statistical data, sections of the Spring and 
Autumn issues of EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT are devoted to the publication of such programs 
as are appropriate to psychometrie procedures. Programs relevant 
to such problem areas as factor analysis, item analysis, multiple 
regression procedures, the estimation of the reliability and validity 
of tests, pattern and profile analysis, the analysis of variance and 
covariance, diseriminant analysis, and test scoring will be con- 
sidered. Customarily a program should be expected not to exceed 
six or eight printed pages. Manuscripts of four or fewer printed 
pages are preferred. Each manuscript will be carefully reviewed as 
to its suitability and accuracy of content. In some instances an 
accepted paper may be returned to the author for possible revisions 
or shortening. The cost to the author will be thirty dollars per 
page for regular running text. The extra cost of the composition 
of tables and formulas will be added to the basic rate. Manu- 
Scripts received up to November first will be considered for the 
Spring issue; manuscripts received between then and May first 
will be considered for the Autumn issue. 


Two copies of the manuscript should be sent to: 


Dr. William B. Michael 
325 Callita Place 
San Marino, California 91108. 
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FORTAP: A FORTRAN TEST ANALYSIS PACKAGE? 


FRANK B. BAKER 


University of Wisconsin 
AND 


THOMAS J. MARTIN? 
University of Wisconsin 


_ Tue FORTAP program is the fifth generation in a family of 
test analysis programs which had its inception as a program on 
the UNIVAC 1103 in 1959 (Baker, 1959). Although the basic 
conceptualization of the program did not change materially, each 
Succeeding generation incorporated additional features and capa- 
bilities, Since the first four generations were written in machine 
language for various computers, their availability was restricted. 
In order to make the program available to a wider range of 
Tesearchers, it was decided to write the fifth generation program in 
FORTRAN. The change in programming language made it nec- 
essary to study the structure of the program, and as a result of 
this study, the second author restructured the basic program. The 
design goals of the latest version were those of flexibility for the 
User of the program and modularity of the internal structure of 
the program. The design goals were accomplished through the use 
‘of two techniques: First, a system of control cards was devised 
which enables one to specify an arbitrary sequence of data process- 
ing and analysis tasks. Second, the computer program was de- 
Signed with a high degree of functional modularity. The control 
Card scheme was implemented through the use of subroutines that 
Tecognize the English words contained in the control cards which 
———— 

1 The research reported herein was performed pursuant to a contract with 
the US, Office of Education, Department of Health, Education and Welfare, 
under the provisions of the Cooperative Research Program. Center no. C-03/ 


Contract OE 5-10-154. 
2 Now with IBM, San Jose, California 
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establish a given sequence of operations and prestore subroutines 
with the proper parameters. The separation of data processing 
and analysis functions was accomplished by constructing a basic 
processor routine which produces a test score for each person and 
the mean score of all persons selecting a given item response 
choice. Because these two vectors of information underlie most 
item and test analysis techniques, it was a simple matter to 
develop subroutines for various analysis procedures and integrate 
them into the program package. 


Analysis Section 


The analysis section of the program consists of three major 
routines, GITAP, RAVE, and BIOITEM. The GITAP item analy- 
sis subroutine is similar to that reported earlier by Baker (1963) 
and produces the following information for every item response 
in the instrument having a scoring key: The number of subjects 
responding, the item difficulty, the item-criterion correlation, and 
approximations to the parameters of the fitted normal ogive. The 
RAVE subroutine performs the reciprocal-averages scaling technique 
which yields a set of item-response keys which maximize the in- 
strument’s internal consistency. The iterative procedure begins with 
a set of a priori item-response weights and uses these to score 
the papers and to compute the Hoyt reliability coefficient. The 
mean item response scores are then used to devise a new set of 
item-response weights, the subject's responses are rescored, and the 
Hoyt index is computed. The iterative procedure is repeated until 
the difference between two Successive Hoyt indices is less than 
05. The output of the subroutine is the optimal set of item-response 
Weights, the Hoyt index, and a test score for each subject based 
upon the derived weights. The BIOITEM subroutine consists of 
two sections. In the first section the basic item-response data are 
restructured and grouped by criterion score so that the proportion 
of subjects choosing a given item response at each criterion score 
is obtained. The second section employs either maximum likelihood 
or minimum 3? estimation procedures to estimate the parameter 
of either the normal or logistic ogive fitted to the item response 
data. The output of this subroutine consists of item parameter 


estimates and x? goodness-of-fit values for each item response for 
those items possessing an item key. 


t2 


x 
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Control Card System 


The control card system consists of a series of cards, each of 
which tells the computer to perform a specific task and also con- 
tains parameters requesting certain options. Some of these control 
cards are associated with data processing and some with analyses. 
The following is a listing of the card types, but thé details of the 
parameters they contain have been omitted (see Baker and Mar- 
tin, 1968) : 


di 


*NEWJOB initiates a new problem and specifies the number 
of items and subjects. 


. “KEYS is the first card in the group of cards containing 


the item-response weights. 

*CRITERION specifies whether an internal or external cri- 
terion is to be used in the analysis routines. 

"GITAP specifies that the item analysis routine is to be 
performed. 


. “RAVE specifies that the reciprocal-averages routine is to be 


performed. 


. *BIOITEM specifies that the curve-fitting item analysis pro- 


cedures are to be performed. 


- “DATA is the first card in the deck of cards containing the 


subject’s item response choices. 


. “RUN All of the previous control cards establish a specific 


configuration of data processing and analysis. The *RUN 
card causes the sequence to be executed. 


The system of control cards can be used to establish many 


different, sequences to be performed; however, to initialize the sys- 


on properly, the initial configuration of control cards must be as 
ollows: 


m 


P 


e 


- The first control card must be *NEWJOB. 


The second control card must be *KEYS, which is followed 
by a set of cards containing the item keys. 


- The third control card must be *CRITERION which may be 


followed by cards containing the external criterion scores. 

The fourth control card must be an analysis specification 
(*GITAP, *RAVE, *BIOITEM). 

The fifth control card must be *DATA, which is followed 
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by a set of data cards or an auxilliary magnetic tape con- 
taining the data must be available. 

6. The last control card in the initial configuration is the *RUN 
card which initiates the execution of the tasks specified. 


Output 


The initial configuration will also result in the output of certain 
information which will not necessarily appear in subsequent analy- 
Ses. These are: A test score for each subject, the summary statistics 
of the score distribution, and the ANOVA table for the Hoyt 
internal consistency reliability index. In the case of *GITAP and 
*BIOITEM these outputs are based upon the initial set of item- 
response weights. In the case of *RAVE they are based upon the 
derived item-response weights. Once the initial configuration has 
been specified, any other combination of control cards may follow. 
For example, if one wished to score the data using three different 
subscales, three sets, each containing *KEY, a set of item-response 
keys, and a *RUN card would be used in conjunction with the 
previously specified analysis. A commonly used sequence is one 
which specifies *RAVE as the analysis in the initial configuration 
and follows it by *GITAP, *RUN to obtain the item-criterion 
correlations based upon the derived weights. Thus, the control 
card system provides the user with a wide range of possibilities 
in terms of order of processing and analysis. 


Data Preparation 


The design goal of flexibility has also been incorporated into 
the preparation of data for the program. The key card for an 
item contains an item number, the number of possible responses 
to the item, and the Weight assigned to each response. The scoring 
procedure is such that the program only processes those items which 
possess a key. Thus, subscale scoring can be accomplished by 
including keys for only those items in the subscale. If all the keys 
for an instrument are the same, the first item key is followed by a 
card containing DITTO. The method of providing the computer 
with scoring keys gives the user considerable flexibility in ob- 
taining total test Scores, and scoring as many subscale scores as he 
may desire. Considerable flexibility is also evident in the ways in 
which item-response data can be presented to the computer. The 
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| means for transcribing the item-response choices made by 

et to a computer in acceptable form, namely punched cards, 
use a DIGITEK optical scanner. The FORTAP program 
e capability for accepting cards punched by this equipment. 
ition, the user can manually key punch the responses in 
a standard punched card format or in an arbitrary card 
. In all three cases the data decks are preceded by a 
j format card telling the computer what card format is 


Capacity of Program 


e capacity of the FORTAP program is subject to the following 
tions: The total number of possible item choices is less 
an 1,800. For example, a test containing 300 five-characteristics 
1,500 item choices; no item has more than seven possible 
e choices; the maximum number of subjects in the sample 
1/67. If the item response data exceed the array allocation 
ain memory, the data are partitioned, and the excess auto- 
ally stored on auxilliary magnetic tape. In actual practice, 
Capacity of the program exceeds that required by most re- 
ers. If all the data fit in main memory, the running times 
Control Data 1604 are extremely short, on the order of 
8 for the *GITAP, and on the order of seconds to a minute 
for the *RAVE or *BIOITEM analyses. The FORTAP 
fam contains an extensive error alarm system, and any fatal 
Causes it to search for the next *NEWJOB. Thus, if one 
Sis fails, the next problem in a series will be performed, a 
ice which is similar to the usual batch processing recovery 
dures, 
Tom the point of view of the person wishing to analyze test 
ilts, the FORTAP program provides a very powerful and flex- 
3 analysis tool, yet the data preparation procedures and control 
System are very simple to use. A parallel form of simplicity 
Within the computer program which was developed to allow 
add additional features, analyses, and control cards to the 
program. Because the programming language is FORTRAN, 
FORTAP program can be adapted to operate on a wide range of 
Computers, For use on medium size computers, à considerable re- 
duction in size of the program can be accomplished by removal of 
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the *BIOITEM section. Copies of the FORTAP program can be 
obtained from the authors by submitting a 7-track IBM compatible 
magnetic tape upon which the program will be written. 
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COMPUTER-AIDED ITEM SAMPLING FOR 
ACHIEVEMENT TESTING: A DESCRIPTION OF 
A COMPUTER PROGRAM IMPLEMENTING 
THE UNIVERSE DEFINED TEST CONCEPT 


DAVID M. SHOEMAKER! 
Oklahoma State University 
AND 
H. G. OSBURN 
University of Houston 


_ A universe defined test is a test constructed and administered 
in such a way that an examinee’s score on the test provides an 
unbiased estimate of his score on some explicitly defined universe 
of item content (Osburn, 1968). Two general requirements for 
test construction are implied in the above definition. The first is 
. that all items which could possibly appear in the test should be 
Specified in advance; this implies that for a specified content area 
it is possible to clearly define a content population of items. 
Secondly, the items in a particular test should be selected by 
Tandom sampling or stratified random sampling from the universe 
(or population) of items defining the content area. 

The idea of random sampling from a universe of test 
hot new. Psychometric theorists have used this idea extensively 
in building reliability and other testing models. The point is that 
few test constructors have seriously attempted to implement the 
Model in actual practice. Osburn (1968), however, has defined a 
Procedure for implementing in practice the universe defined test 
concept. The focus of the present paper is not an elaboration of 


the implications of the universe defined test for test construction 


items is 


=e 
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but is, instead, a description of a computer program capable of 
generating random or stratified random parallel tests from a spe- 
cified content population. 


Item Forms Analysis 


One possible way of specifying a finite universe of test items 
is that of cataloguing all items that would be allowed to appear 
on the test. However, a potentially more useful approach to de- 
fining a universe of content is that of analyzing the content area 
into a hierarchical arrangement of basic item forms. The item 
form is a fundamental concept in both the universe defined test 
and in the computer program for item generation. 

An item form is a principle or procedure for generating a sub- 
class of items having a definite syntactical structure. Item forms 
are composed of constant and variable elements and, as such, 
define classes of item sentences by specifying the replacement sets 
for the variable elements. An item form may be very general 
and abstract or quite specific and particular. The analysis of a 
content area into item forms proceeds from the general to the 
specific in much the same way as an ordinary subject outline or a 
behavioral objective outline with one critical difference. In item 
forms analysis there is an unbroken link between the abstract 
system and the individual item sentence, This property makes it 
possible to unambiguously define a universe of content as an 
hierarchical arrangement of item forms together with the replace- 
ment sets for the variable elements. 


Description of Computer Program 


In the opinion of the authors, the desirable features of a com- 
puter program for generating universe defined tests using the 
concept of the item form are as follows: 


1, The program must be capable of generating k tests each 
composed of n items, where k and n are specified by the user. 

2. The program must be capable of generating randomly par- 
allel or stratified random parallel tests. 

3. The procedure for coding and constructing item forms and 
random expressions must be relatively simple and flexible. 

4. Given item forms and replacement sets, the program must 


OI—C———————————-—-—-—-——————ET PTT WAR 


SHOEMAKER AND OSBURN 167 

be capable of assembling from diverse sources (alphanumeric ex- 
pressions stored in core and on tape) the appropriate content for a 
specific item. 

5. A flexible system for supplying random numbers must be 
available; the system must be capable of supplying a single random 
number, a series of random numbers, a dependent random number 
(the nth number is a function of previously generated numbers in 

į a specific item form), and both frequency and probability dis- 
tributions—including joint distributions. Furthermore, the user 
must have control over the range and precision of all random 
numbers generated. 

6. The printout of each item must resemble a format typically 
encountered in paper-and-pencil tests—this implies a flexible for- 
mat control system for item output. 

7. An updating procedure must be available for modifying or 

supplementing previously defined item forms and replacement sets. 

8. The program must be capable of computing the answer to 

| each item generated. 


The computer program consists of one monitor subprogram 
(MAIN), two primary subprograms (MAKTXT and TESTS), two 
Secondary subprograms (RNUMBR and PRINTO), and six aux- 
ilary subprograms (CTOF, FTOC, COMPZ, DCOMPZ, RAN- 
Dom, and TCHECK). Input to the program is in the form of 
independent data blocks—three of which are the block containing 
the item forms, the block containing the replacement sets, and the 
block Specifying which item forms are contained within each stra- 
tum. The function of the MAKTXT subprogram is that of (a) 
Nading in the item forms, replacements sets and strata codes, 
(b) Storing the alphanumeric data in core or, if the data storage 
exceeds core size, on auxillary tapes, and (c) construction of an 
accounting system specifying the location in core or on tape of the 
relevant; context for each item form and each replacement set. 

Wen a specific stratum number, the second primary subprogram 
(TESTS) randomly selects an item form within that stratum, 
“ubstitutes randomly sampled elements from replacement sets into 
a item form, and prints out the particular item resulting from 

at item form, TESTS calls upon two secondary subprograms, 

R and PRINTO; the former subprogram supplies all 
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random numbers and distributions of numbers and the latter sub- 
program outputs the item according to the format specified by the 
item form constructor. Both secondary subprograms are unique 
and will be described in more detail at a later point. Of the six 
auxillary subprograms, four (CTOF, FTOC, COMPZ, and 
DCOMPZ) are concerned with manipulation of alphanumeric 
information: CTOF converts a number expressed in alphanumeric 
characters into its corresponding floating-point or integer equiva- | 
lent, PTOC converts a floating-point or integer number into its 
corresponding alphanumeric equivalent, COMPZ composes six or 
less individual alphanumeric characteristics into one alphanumeric 
word, and DCOMPZ decomposes one alphanumeric word into six 
individual alphanumeric characters. The remaining two auxillary 
subprograms, RANDOM and TCHECK, are, respectively, designed 
to supply random numbers from a rectangular distribution of 
number with 0.00—1.00 range, and to check that the contents of 
the appropriate tapes are in core during the execution of each item 
form. 


The RNUMBR Subprogram 


The random number plays an important role in the item form 
in that it is only through the substitution of elements from the 
appropriate replacement sets and/or random numbers into an item 
form that a specific item is generated. In practice, the substitution ' 
of random numbers into an item form may require the generation 
of one specific random number; however, it is more frequently the 
case that a series of random numbers, a distribution of numbers, | 
or the generation of a random number which is a function of 

: previously generated random numbers are required. Through call- 
ing the RNUMBR subroutine with the appropriate arguments the 
computer program is capable of generating one random number, & 
series of random numbers, a single probability distribution, 4 
single frequency distribution, a joint probability distribution, oT ® 
joint frequency distribution. The arguments to the subroutine also 
specify the acceptable range and precision of the random number |, 
or numbers to be generated—an important feature because the 
difficulty level of a specific item may be a function of the random 
number or numbers generated. 

As an example of one random number being dependent upo? 
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previously generated random numbers, consider an item form spe- 
cifying three random number replacements. The first random num- 
ber specifies the mean of a distribution; the second, the standard 
deviation; and the third, a number sampled at random from within 
two standard deviations from the mean. Furthermore, to facilitate 
the computations involved, assume that the item form constructor 
would like the mean of the distribution to be a random integer 
number between 10 and 20 and the standard deviation to be 
another random integer between 5 and 8. The following arguments, 
each beginning with the character '$', inserted in the item form 
would produce the desired results. 


Input: GIVEN A NORMAL DISTRIBUTION WITH MEAN 
EQUAL TO $01 0 20 1 1 AND STANDARD DEVIA- 
TION /1H0/ EQUAL TO $01 508 0. IF ONE NUMBER 
IS RANDOMLY SAMPLED FROM THIS DISTRIBU- 
TION, /1H0/ WHAT IS THE PROBABILITY THAT 
THIS NUMBER WILL BE GREATER THAN OR 
EQUAL /1H0/ TO $01 0 +1-2—2 0 +1+2+2 0 FINIS 


Output: GIVEN A NORMAL DISTRIBUTION WITH MEAN 
EQUAL TO 17 AND STANDARD DEVIATION EQUAL 
TO 5. IF ONE NUMBER IS RANDOMLY SAMPLED 
FROM THE DISTRIBUTION WHAT IS THE PROB- 
ABILITY THAT THIS NUMBER WILL BE GREATER 
THAN OR EQUAL TO 24 


The PRINTO Subprogram 


The output format for each item form is determined by the 
format, control words inserted in the item form by the item form 
constructor. Format control words are'standard Fortran format 
Codes delimited by /slashes/. The PRINTO subroutine parses out 
the format control words within each item form and reassembles 
them into a standard Fortran format vector. By inserting format 
control words within each item form, the item form constructor 
can display the resulting item in any format desired. Given the 
item form described above, the following format vector would 


have been constructed by the PRINTO subroutine: (1H0, 


7241/1H0, 70A1/1H0, 70A1/1HO, 6A1). 
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Sample Coded Item Forms With Items Generated 


Many item forms require the substitution of an expression ran; 
domly sampled from a replacement set. In the sample item forms 
given below, a pair of integers—the first of which is preceded by a 
double minus sign—indicates that an element from a replacement 
set is to be inserted into the item form at that point. The first 
integer specifies the code number of the replacement set and the 
second number indicates the dependency link between the specified 
replacement set and another replacement set. For example, the 
instruction—24 0 would result in an expression being randomly 
sampled from replacement set 24 and inserted into the item form 
at that point;—24 18 would result in the kth expression being 
selected from replacement set 24 where k is the number of the 
expression most recently selected from replacement set 18. In the 
computer program an element in a replacement set may require 
within itself the substitution of an element from another replace- 
ment set; however, the degree of nesting is limited to one. 

The following are several examples of coded item forms and 
specific items generated in the content area of statistics. The 
sample item forms have been chosen so as to illustrate the flexibil- 
ity and major capabilities of the computer program. The replace- 
ment sets are not given. 


Input: 
GIVEN THE FOLLOWING SETS /1H0//05X/ A = —— 10 
/1H0//05X/ B — —— 10 /1H0//05X/ C — —— 10 /1H0//05X/ 


D=--10 /1H0/. COMPUTE THE PROBABILITY THAT 
AN ELEMENT IS IN ——42 0 FINIS 
Output: 
GIVEN THE FOLLOWING SETS 
A = (SUSAN PAMELA ALICE BETH KATE) 
B = (SALLY KATE MARY SUSAN ANN) 
C= (ALICE JANE PAMELA MARY SALLY) 
D = (ANN BETH KATE SUSAN) 


COMPUTE THE PROBABILITY THAT AN ELEMENT I8 IN 
BOTH B AND C. 


Input: 
GIVEN THE FOLLOWING JOINT FREQUENCY DISTRIBU- 
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TION. /1H0//R07/ $004 125125 /F07//1H0/ WHERE X IS THE 
VARIABLE ALONG THE ABSCISSA AND Y IS THE VARIA- 


| BLE ALONG THE ORDINATE. IF X I8 ——53 0 AND ONE 


. VALUE OF X IS SAMPLED AT RANDOM COMPUTE THE 
PROBABILITY THAT Y IS ——54 0 FINIS 
Output: 


GIVEN THE FOLLOWING JOINT FREQUENCY DISTRIBU- 
TION. 


09-10 0 2 4 2 0 
07-08 2 13 21 13 2 
05-06 4 21 35 21 4 
03-04 2 13 21 13 2 
01-02 0 2 4 2 0 


4 01-02 03-04 05-06 07-08 09-10 
WHERE X IS THE VARIABLE ALONG THE ABSCISSA AND 
Y IS THE VARIABLE ALONG THE ORDINATE. IF X IS 
BETWEEN 1 AND 4 AND ONE VALUE OF X IS SAMPLED 
AT RANDOM, COMPUTE THE PROBABILITY THAT Y IS 
LESS THAN 9. 


Input: 

GIVEN THE FOLLOWING PROBABILITY DISTRIBUTION. 
/1H0//R05/ $0012 000125 /F05//1H0/ SUPPOSE THAT $01 
3050 NUMBERS ARE RANDOMLY SELECTED (WITH RE- 
PLACEMENT) FROM THIS DISTRIBUTION. COMPUTE 
THE PROBABILITY THAT ——73 0 FINIS 


Output: 
GIVEN THE FOLLOWING PROBABILITY DISTRIBUTION. 
09-10 05 
07-08 +25 
05-06 .40 
03-04 .25 


01-02 .05 
SUPPOSE THAT 3 NUMBERS ARE RANDOMLY SELECTED 
(WITH REPLACEMENT) FROM THIS DISTRIBUTION. 
COMPUTE THE PROBABILITY THAT AT LEAST 2 OF THE 
NUMBERS ARE BETWEEN 3 AND 6 


Input: 
~~ 82 0 ——64 32 ——88 32 FINIS 
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Output: 


IT IS HYPOTHESIZED THAT.A CERTAIN RAT HAS A .15 
PROBABILITY OF TAKING THE LEFT HAND ALLEY, A 
-12 PROBABILITY OF TAKING THE CENTER ALLEY AND 
A .78 PROBABILITY OF TAKING THE RIGHT HAND ALLEY 
IN A MAZE. SUPPOSE THAT THE RAT IS GIVEN ONE 
TRIAL COMPUTE THE PROBABILITY THAT THE RAT 
TAKES EITHER THE LEFT HAND OR THE CENTER ALLEY. 
Discussion 

Several investigators (e. g., Lord, 1962; Osburn, 1967; Shoe- 
maker and Osburn, 1969) have demonstrated the desirability of 
conducting unmatched data studies in which a different set of 
n test items is generated for each individual. The computer pro- 
gram is a step forward in the practical implementation of this 
concept; however, the task is as yet incomplete. The complete 
implementation of the unmatched data investigation requires that 
a procedure be developed for computing the answer for each item 
generated. The trend of the current research is towards the develop- 
ment of an algorithm for computing answers to items. 

In the area of educational measurements, the computer program 
for generating universe defined tests has several immediate applica- 
tions: (a) the problem of constructing a classroom test repre- 
sentative of the course content is resolved, (b) prior to taking an 
examination, students can be given sample items representative 
of items which will appear on the test, (c) “make-up” examinations 
can be constructed quite easily, and (d) one individual may be 
repeatedly tested using random or stratified random parallel tests. 
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A PROGRAM TO COMPOSE AND PRINT TESTS 
FOR INSTRUCTIONAL TESTING USING 
ITEM SAMPLING! 


W. GORTH axm A. GRAYSON 
Stanford University 


À In instructional testing, item sampling designs are becoming 
important. Cronbach (1964) has indicated their usefulness in im- 
proving courses. One difficulty in implementing these designs is 
the need for a large number of different test forms. A FORTRAN 
program is available to compose and print any number of tests 
consisting of questions, multiple-choice or completion type. selected 
from an item pool. 


Program 


The program reads a test title card, the question cards, and 
the specification of items as well as their order on the test 
forms. The question cards are checked for cards out of sequence 
or missing, and incorrect questions are rejected. The program prints 
the number of copies of test forms requested, their answer key, 
and the error messages generated during the checking. Two eight- 
by-eleven copies of a page of the test are printed on each sheet of 
output. Questions are printed entirely on one sheet of output, not 
divided between two. Each page is labeled with the title, the 
form number, and the page number. The alternatives for multiple- 
choice questions are randomized. The answer key includes the 
title, the form number, the question number, the letter of the 
correct alternative, the subpool to which the question may be 
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assigned, e. g., content category, and the message of symbols or 
diagrams which are to be added to the question. Cards may be 
requested which contain the answer key information in a format 
for a program (Gorth, Grayson, Popejoy, and Stroud, 1968) which 
stores answer keys, student identification information, and re- 
sponses. 

Additional information about the program and its availability 
may be obtained from William Gorth, SCRDT, School of Edu- 
cation, Stanford, California 94305. 
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A TAPE-BASED DATA BANK FROM 
EDUCATIONAL RESEARCH OR INSTRUCTIONAL 
TESTING USING LONGITUDINAL ITEM SAMPLING? 


W. GORTH, A. GRAYSON, L. POPEJOY, Au» T. STROUD 
Stanford University 


Errumr educational research or instructional testing using longi- 
tudinal item sampling at a school site generates a large amount 
of diverse data. For example, Project CAM (Allen, DeLay, Gorth, 
and Popejoy, 1968) includes approximately 10 courses in the 
public high schools. Each of the 10 courses has an average of 
200 students. For each student there are 15 entries of background 
information, including age, sex aptitude and attitude scores, and 
demographic information. The high school courses have been di- 
vided into as many as 30 different instructional units, which could 
be considered experimental treatments. The 30 units might be 
assigned in any order and might be completed on any day. In 
Order to monitor different aspects of the achievement of the 
students during the school year, 300 questions, each with five 
alternative responses, have been written for each course. Each 
question is categorized into strata of each of six different dimen- 
sions. Each dimension can be used to select questions for sub- 
sequent analysis. The questions are administered in the form of 
10 tests, each containing 30 questions. These tests can be ad- 
Ministered in any order to the students on any day of the school 
Year. The student on the average sees at least 10 tests during 
the school year. The students’ actual responses are retained. The 
above research design reduces to the usual instructional testing 


Bere o 
!The research herein was performed pursuant to a grant from the Charles F. 
Tetiering Foundation to REER TW. Allen, Dean, School of Education, 


University of Massachusetts. 
175 


176 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


pattern if all students proceed through the course at the same 
rate and are tested at the same time with the same test. 


Program 


- 


Since the number of variables, i.e., students, treatments, ques- 
tions, tests, and dates of administration, would be large a very 
general system of data storage had to be developed which would 
accommodate a variety of analyses. The system devised uses a 
computer program written in FORTRAN for the IBM 360/67 
computer. It is able to collate and update a tape-based data bank 
at any time, so that data about tests, items, student responses, 
or instructional treatments are obtained; they may be immediately 
incorporated into the data-bank. 


Input 


Student data. Name, birthdate, sex, student ID number and up 
to 15 two-digit covariate scores, such as SCAT-STEP tests, IOWA 
tests, PORTLAND PROGNOSTIC and other achievement tests, 
IQ tests, attitude tests, school counselor, elementary school at- 
tended and assorted demographic data. 

Test and question data. Test form number, question numbers 
(up to 50 per monitor), question answers and space for 15 four- 
digit classifiers which specify the material measured by each ques- 
tion. 

Student response data. The actual response for each question 
taken by a student is recorded, together with the testing period and 
a representative testing period date, eommon to all students 
monitored during the testing period. The program automatically 
Scores each response and records the result. 

Test objectives data. The text of test objectives for each course. 

Instructional treatment data. A three-digit number which can 
identify a unit, contract or learning package that has been com- 
pleted, the level at which it was completed, called the mode, the 
date it was completed, and a two-digit score on any test taken 
at the completion of that contract. 


Related Programs 


The data bank may be listed at any time. Other programs 
which may be used with the tape (a) compose and print tests 
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for instructional testing using item sampling (Gorth and Grayson, 
1968), (b) tabulate and plot achievement profiles of longitudinal 
achievement testing using item sampling (Gorth, Grayson, and 
Stroud, 1968), (c) evaluate item performance by internal and 
external criteria in a longitudinal testing program using item 
sampling (Gorth, Grayson, and Lindeman, 1968), and (d) produce 
reports for students about their achievement at each testing period 
(Gorth, Grayson, and Pinsky, 1968). 

Additional information about the program and its availability 
may be obtained from William Gorth, SCRDT, School of Edu- 
cation, Stanford, California 94305 
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A COMPUTER PROGRAM TO TABULATE AND PLOT 
ACHIEVEMENT PROFILES OF LONGITUDINAL 
| ACHIEVEMENT TESTING USING ITEM SAMPLING* 


W. GORTH, A. GRAYSON, Ax» T. STROUD 
Stanford University 

By administering an entire set of items, which are developed to 
measure all of the major achievement objectives for a course, at 
Several testing periods, e.g., biweekly, during a course, and by 
using an item sampling design, estimates of achievement can be 
obtained for each period (Allen, DeLay, Gorth, and Popejoy, 
1968). Teachers would find tables or histograms of the percentage 
student achievement, on a criterion pool of items or on subsets 
of the pool versus testing period, a useful description of the stu- 
dents’ achievement in the course. A general FORTRAN com- 
puter program is available to make these calculations. 


Input 
The program reads input which has been organized into a tape- 
based data bank (Gorth, Grayson, Popejoy, and Stroud, 1968) 
of data from instructional testing in a course using item sampling. 
Th the data bank, both student and item are identified by number. 
Each item is also classified as measuring an educational objective 
of the course, The response to each item by each student is recorded. 


Output 
The program allows requests for tables and histograms to be 
calculated for (a) all students or subgroups of students, (b) = 


— 
å The research herein was performed pursuant to a grant from the Charles F. 
Kettering Foundation to De. Dwight Y. Allen, Dean, School of Education, 
University of Massachusetts. 


179 


180 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


questions or subgroups of questions, (c) questions selected by 
classifiers, and (d) a specified sequence of testing periods. If the 
sequence of testing periods uses a pretest or posttest different 
from the intervening tests, either or both may be considered sepa- 
rately in the tabulations. Finally, moving averages, which average 
achievement across more than one testing period, may be requested. 

The output includes (a) a list of the parameters specified, (b) 
a calendar of dates of the testing periods, (c) table requested of 
testing period versus percentage achievement (number of items 
used in calculating percentages), and (d) histograms requested 
of each table. 

Additional information about the program and its availability 
may be obtained from William Gorth, SCRDT, School of Edu- 
cation, Stanford, California 94305. 
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A COMPUTER PROGRAM TO EVALUATE ITEM 
PERFORMANCE BY INTERNAL AND EXTERNAL 
CRITERIA IN A LONGITUDINAL TESTING PROGRAM 
USING ITEM SAMPLING! 


W. GORTH, A. GRAYSON, anv R, LINDEMAN 
Stanford University 


THE use of item sampling procedures to establish norms for 
student achievement during a school year presents difficulties in 
evaluating item performance. First, students may be asked to 
answer items on which they have had instruction many months 
in the past. Secondly, a student’s total score on a test does not 
Provide an adequate criterion for evaluating the performance of 
the items on the test. This situation contrasts to the usual teaching- 
testing situation where the teacher administers a test measuring 
only the material taught the preceding few weeks. In this case, 
the students have all been exposed to the instruction within a short 
Period before the test and have all had an opportunity to learn _ 
the material. The student’s total score on the test is an adequate 
criterion for evaluating item performance. 

Lindeman, Gorth, and Allen (1968) describes an item evaluation 
Procedure by which an appropriate criterion score may be calcu- 
lated for each student. These scores are then used in an item 
analysis to estimate the difficulty and discrimination indices for 
test items. The computer program described below is available for 
Making these calculations. 


Input 
The program, which is written in FORTRAN, operates on data 
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from a tape-based data bank (Gorth, Grayson, Popejoy, and 
Stroud, 1968). The item-analysis program locates a specific school 
and course in the data bank. It selects either all of the items or a 
specific subset of items upon which to base the criterion score and 
the item analysis. It selects all students or a specified group of 
students to be considered in the item analysis. A set of covariate 
measures for each student is available in the data bank, and each 
of a specified number of these external measures may be con- 
sidered as an external criterion in the item analysis or a score 
may be caleulated. When criterion scores are calculated for the 
sampling items, students with fewer than a minimum number of 
items may be excluded from the analysis. The program allows 
item performance to be calculated for time intervals designated 
from reference times either at each student’s completion of partic- 
ular instructional lessons measured by the item or at a specific 
calendar date. Up to five separate time intervals are specified by 
the beginning and ending number of days from the reference time 
(Lindeman, Gorth, Allen, 1968). 


Output 
The output from the program includes: 


1. A table of (a) the student identification numbers, (b) the 
student percentage items answered correctly, (c) the number of 
items included in each student’s score, and (d) the individual 
item numbers for items included in the student’s criterion score, 

2. A table of the distribution of items, answered correctly and 
incorrectly in 15-day intervals from the beginning to the end of 
the time interval designated and, 

8. A table of (a) the item identification number, (b) the number 
of students to whom the item was administered, (c) the item 
difficulty and discrimination indices, and (d) the average criterion 
score and its standard deviation for those students answering the 
item correctly, for those answering it incorrectly, and for the 
combination of students answering it correctly or incorrectly. 


Additional information about the program and its availability 
may be obtained from William Gorth, SCRDT, School of Edu- 
cation, Stanford, California 94305. 
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_ EXTENDED COMPUTER USE IN COMPILING A 
FLANDERS-AMIDON INTERACTION 
ANALYSIS MATRIX? 


DONALD ARY, EDWARD GOTTS, Ax» JOHN SHAVER 
Indiana University 


and in teacher training. This system classifies teacher and 
verbal behavior into ten categories. An observer records a 
representing one of those categories every three seconds to 
iate the occurring behavior. The resulting chain of digits is 


E tting such matrices by hand has proved very time consuming 
‘subject to error. Rippey (1965) has described a procedure for 
hing the chain of digits on cards and using a computer to 

he matrix. This procedure greatly reduces the time and 

_ &xpense involved in handling the data. 

Ww e have found the following extensions of Rippey's procedure 

in further reducing time and expense and in producing a 


Observers record their observations directly on IBM sheets. 
An optical scanner is then used to transfer the data directly 
onto cards. This avoids the expense and possible error in- 
m volved in key punching. Observers record the ten in the 
Flanders-Amidon categories as zero on the sheets and this 
Converted to ten on the computer output. 


computer procedures described in this article were developed in con- 
with the “German Language Project” of the Research and De- 
ent Office, School of Education, Indiana University. 
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2. The output matrix is plotted to show the proportion each 
cell contributed to the grand total, the row, and the column. 
For example, the following information was compiled for 
the 4-8 cell in one matrix we plotted using this system: 


1950 
m 062 
R 359 
C 518 


This indicates that the 4-8 interaction pair, teacher question 
followed by predictable student, response, occurred 1950 times 
and accounted for 6.3 per cent of all interaction pairs. The 
Tow per cent indicates that 35.9 per cent of the cases where 
4, teacher question, occurred it was followed by 8, student 
predictable response, The column per cent indicates that 51.8 
per cent of the cases where 8, student predictable response, 
occurred it was preceded by 4, teacher question. The cell, 


Tow, or column proportions can be combined to produce 
any ratios desired. 


A printout of this FORTRAN program is available from the 
authors on request, 
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A COMPUTERIZED APPROACH TO CALCULATING 
INTERACTION ANALYSIS OBSERVATION 
MATRICES AND RATIOS? 


EDWARD F, KRAHMER, RICHARD W. KUNKEL, JACK W. BARDEN, 
A. C. LINDEM 


University of North Dakota 
AND 
MARGARET ABBOTT 
Grand Forks (North Dakota) School District 


A recent development in measuring verbal interaction which 
has shown considerable promise according to the educational lit- 
erature is interaction analysis. The use of interaction analysis is 
not limited solely to teacher supervision. The literature reports 
many applications with regard to identifying and measuring change 
in verbal interaction patterns. Utilized with attitude, achievement, 
Personality, and other measuring devices, interaction analysis has 
Potential as a criterion measure. 

The remarkable growth in applications of this technique might 
be even more spectacular if converting raw observation response 
data into matrices and calculating the various ratios (Amidon 
and Hough, 1967) was not such a time-consuming task. This 

report describes a simplified procedure for the collection of data 
from interaction observations and a multi-purpose computer pro- 
gram for the conversion of the data into usable results. 


Data Collection Procedures 


The evaluation, on a limited budget, of a USOE Title III 
Project, wherein interaction analysis observations were conducted 
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on approximately two hundred teachers, necessitated a short-cut 
in data collection procedures. The procedure described depends 
upon the availability of an optical scanner (in this project an 
IBM 1230 which provided punched card output and IBM number 
555 scanner sheets were used). The 555 scanner sheet has one 
hundred rows numbered from zero through nine for responses, 
with the odd numbered rows on the left hand side of the sheet. 

The common twenty minute observation period with four hundred 
observation responses requires four scanner sheets. These were 
numbered from one to four in the upper left-hand corner and the 
backs used for such data as observer and teacher name, subject 
matter being taught, and date. Categories one through nine were 
coded the corresponding number (1 through 9), while category 
ten was coded zero (0). 

The one problem with this procedure was multiple marks in a 
column. This occurred approximately once in a thousand responses. 
If the following rules are observed, the problem of more than one 
mark per column can be further reduced. 


1. Erase completely all marking errors. Check scanner sheets at 
completion of observation period for random marks. 

2. Use a ruler or piece of paper as a guide to insure only one 
mark in a response row. 

8. Code the first fifty observation responses in the odd num- 
bered response rows on the left-hand side of the scanner 
sheet and the next fifty responses in the even-numbered re- 
sponse rows. This eliminates alternating back and forth from 
left to right on the sheet. Since a complete scanner sheet 
cannot be punched on one IBM card, two cards were punched, 
one containing data from the left side of each scanner sheet 
and the other containing data from the right side. 


Computer Program 
Hardware and Software Requirements 


The program was written to be operated on an IBM 360/30 
computer with 32K memory and an E level FORTRAN IV com- 
piler. A printer, card reader, card punch, and one disk drive are 
required. It is possible to convert to tape in place of disk or to 
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eliminate both of these hardware devices as described in the 
manual for this program. " 


Daía Input 


Data input consists of a header card which provides data con- 
cerning the size of the problem and the output modes. The following 
limits exist in the program: 


1. Maximum of 9,999 interaction observations. 

2. Maximum of 78 observation responses on one card. 

3. Unlimited number of cards in the set for one observation 
as long as the same number of cards is used for each ob- 
servation and the total number of observation responses does 
not exceed 1,200. 


If the cards in each observation set are numbered consecutively 
from one on in columns 79-80, the program will check for missing 
or out-of-sequence data cards. 


Data Output 


Any or all of the following four outputs can be determined from 
the raw observation data and converted to printed and/or punched 
lormat as specified by information on the header card described 
inthe Data Input section. 


1. The matrix for each observation with column and row totals 
and column per cents for the printed results. 

2. A summary matrix in the same format as “1” for all the 
observations. 

3. The ratios for each observation as described by Amidon 
and Hough (1967) includes indirect/direct, revised indirect/ 
direct, teacher talk, student talk, content cross, extended 
indirect, influence, extended direct influence, teacher response 
to student comments, student talk following teacher talk, 
silence or confusion, and steady-state cells. 

4. The same ratios as in “3” for the summary matrix. 


Summary 


f This report has outlined a simplified procedure for collecting 
interaction analysis observation data by using optical scanner 
sheets and the IBM 1230 Optical Scanner. A computer program 
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is available for converting the raw observation data to matrices 
and ratios for each observation, and a summary matrix and ratios, 
A manual and/or card deck for this program can be obtained 
by writing: 
Bureau of Educational Research 
University of North Dakota 
Grand Forks, North Dakota 58201. 
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COMPUTER GENERATION OF SEMANTIC 
DIFFERENTIAL (SD) QUESTIONNAIRES 


ROBERT B. KANE 
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Tus typical SD questionnaire consists of directions to S fol- 
lowed by the first concept with its associated adjectival scales. 
Realizing the possibility that progressive effects and treatment 
interactions may occur, E determines the order and polarity of 
the scales in some random fashion. Thus, while the ordering of 
the scales and their polarity remains invariant, at least the effects 
are free of E's bias with respect to the variables being studied. 
Order effects among the concepts included are often “balanced” 
by presenting the concepts to Ss in several different randomly 
determined orders. The solution, then, has been to take account of, 
but not necessarily minimize, proximity error within SD question- 

naires, 

_ While the weaknesses of this solution are obvious, better solu- 
tions have not been economically feasible. Print shop and office 
duplicating machines are not generally designed to produce variable 
formats. Moreover, how can E be sure he is reducing substantially 
these kinds of bias by having a number of different versions of 
8D questionnaire sheets used in his data collection? 

Each of these objections now may be rebutted. Houston (1967) 
has reported improved methods of controlling proximity error by 
employing a Monte Carlo technique for identifying Latin squares 
Whose columns yield a sequence of permutations of n items such 
that proximity error may be dramatically reduced for n < 23. 

ince many applications of the SD involve fewer than 23 scales 

Per concept and fewer than 23 concepts per administration, Hous- 

ton’s work has obvious import for Hs using the SD. 
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The Program 


Following a suggestion from Madison and Goodrich (1965), . 
a program was written for the IBM 7094 computer that generates | 
SD questionnaires of up to 22 concepts with up to 22 adjectival 
seales per concept. This program utilizes Houston’s method of l 
minimizing proximity error. 

Three sources of proximity error may be treated by applying 
this program: (1) the presentation order of the adjectival scales 
used to measure the meaning of a concept, (2) the presentation 
order of the concepts, and (3) the polarity of each scale (which 
end is positive). Scale polarity is determined scale-by-scale by 
reference to a stream of random digits in the computer’s memory. 
Proximity errors caused by scale-presentation order and concept- 
presentation order are minimized by using the particular permuta- 
tion of n items found by Houston to generate the Latin square 
yielding the most favorable index of proximity error. 

The output is a set of SD questionnaires for as many Ss as 
E desires, Each questionnaire includes a standard set of directions 
as suggested by Osgood, Suci, and Tannenbaum (1957), followed 
by the concepts with their associated scales. According to the | 
wishes of E, the scale polarities, scale orders, and concept orders 
may each be held invariant or varied to minimize proximity error. 

If the number of Ss is large, the scoring of SD questionnaires 
containing essentially randomly ordered items is no small task. In } 
such cases items may be printed on mark sense cards so that 
Ss responses may be machine-processed. 
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METHODS OF GENERATING RANDOM 
NORMAL NUMBERS 
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Srororr (1968) presented a general method using a digital 
computer to generate random numbers having various specified 
distributions. Like most general methods it is not efficient when 
used for the particular case. 

His method is essentially that used for Monte Carlo integration 
of a function and is described by Moshman (1967). The specified 
density is inscribed in a rectangle of width z and height y. 
One could estimate the percentage of the total area, zy, under the 
density by generating random coordinates (1j) and counting the 
number falling below the inscribed curve. Stoloff’s method is in- 
verse integration in that a random coordinate is generated and if 
it falls below the curve, then the z-coordinate is used as a random 
deviate having the specified density. One should note that the 
dimensions of the rectangle are arbitrary. If uniform random 
numbers on the unit interval are being generated as inputs to this 
method, then a convenient height is unity. The inscribed density 
can be scaled so that its maximum ordinate is one. Consequently, 
for generating normal deviates only the exponential part of the 
normal curve formula need be calculated for comparison with 
tandom heights. 

A disadvantage of this general method is that a truncated dis- 
tribution is generated since the user must specify the width of the 
Tectangle. Also, several numbers must be generated in order to 
obtain a pair falling below the curve. For the normal curve in- 
scribed in a rectangle of width six and height one, the ratio of 
number of pairs discarded to the number of pairs used is about 
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seven to five. Therefore, about 24 uniform random numbers must 
be generated for every five normal numbers obtained. In addition, 
twelve exponential calculations are necessary to obtain these same 
five normal numbers. If a width of eight standard deviations is 
specified, the ratio increases to 11 to five. 

Stoloff scales the maximum ordinate of the normal curve at 
39894 which leads to inefficient use of the area. His program will 
discard about 2.5 times more pairs than necessary with the result 
that about 39 uniform random numbers must be generated in order 
to obtain five normal numbers on a range of six standard devia- 
tions. Furthermore, this requires an unnecessary multiplication by 
-39894 for every z-coordinate generated. 

Before using Monte Carlo inverse integration one should in- 
vestigate whether it would be more efficient to solve for x in the 
expression: 


Fe) = | (9a 
where F(x) and f(x) are the distribution and density functions, 


respectively. Using Stolofi's second example, the exponential density 
is: 


qui f@=ae* Oz 


Fa) =1-—¢™ 
and 


«= —In [1 — F(z)]/a 
A uniform random number on the unit interval is generated and | 
put equal to F(x). The required random x from this density is 
obtained using the above solution. This method is quite efficient, 
since no numbers are discarded and no restriction on the range of 
t is necessary. 
One difficulty with the normal density is that no closed form of 
the integral exists. However, approximations do exist, and a partic- 
ularly good one is given by Burr (1967). It is 


1- F)* — 1} — [F° — 1} 
-823968 
where F is the value of F (z),a = —1/6.158, b = 1/4.874. 
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this approximation, however, may not be efficient as 
since four roots are required for each random number. 

g normal numbers a third method is to use the sum 
ber of uniform random numbers (usually twelve). This 
requires no calculation of roots or exponentials and is 
program. The effective range is plus and minus six standard 


ia and Bray (1964) present yet another method for 
random normal numbers. It is the most efficient method 
found, but it does not seem to be well known. Briefly, 
hod consists of expressing the normal density as a mixture 
-distributions, two of which are very easy to evaluate. 

t f(x) represent the normal density. 


= 8638 gi(x) + .1107 ga(z) + a ga(z) + b ga(2) 
where the g;(x) are the sub-distributions 
= .0228002039, b = .0026997961 


essions for gs(x) and g4(x) are somewhat complicated 
z) and go(x) are simply the distributions of the sum of . 
nd two uniform numbers on the unit interval, respectively. 
ities are scaled so they both have means of zero and 
of one and three-eighths, respectively. 


z = 2(m + m + m — 1.5) 
h probability .1107 use gz (x) by computing 


z = 1.5(m + pu — 1) 
are uniform on the unit interval. 
probability, a, use gs (x) where 


y — 4.73570326(3 — 2°) 
— 2.15787544(1.5 — lel) |a| <1 
g(x) = Jy — 2.36785163(3 — lel)? 
— 2,15787544(1.5 — |z|) 1< 
— 2.36785163(3 — |z|)? 1.5 


y = 1749731196 ^ 


1900 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT | 


This is accomplished by obtaining a random pair, (z, y), by 
drawing two uniform random numbers on the unit interval, (m, 
pa), and calculating 


z= ôm -3 
y = .358u; 
Then calculate gs(z). If y < gs(z), the normal number is z. 
If y > s(x) repeat the procedure by obtaining a new (z, y). 
With probability, b, calculate g(x) by forming a pair, (x, y), 
according to: 


z = vty = vat 


where 


t = (9 — 2 Ins)/s‘” 

s-u + », 
and (vı, v) are two uniform numbers on the interval, —1 to +1, | 
with the constraint that s < 1. If either z or y is greater than three, 
then that number is used as the random normal number. If both z 
and y are greater than three, repeat the procedure. 

This method appears complicated and indeed, it requires some- 
what involved programming. The advantage of this mixture method 
is that most of the time (97.45%) only three or fewer uniform 
numbers are needed to obtain a normal number (in addition 
to the one needed to make the decision relative to g,(2)), and 
only addition and subtraction is required. Furthermore, the re- 
sulting density is exact with no truncation of range. 


Each of the four methods outlined above was programmed in 
standard FORTRAN and run on an IBM 360, model 50. Each 
method was written as a subroutine in a larger program which 
calls the particular method being used, groups the resulting num- 
bers in a frequency distribution, calculates the first four moments; 
and outputs results. Each method was separately compiled together 
with the main program and run four times: once for compiling 
alone, then for compiling plus generation of 1000 normal numbers, 
then for 10,000 and 90,000 normal numbers. Running times wer? 


Empirical Comparison of Methods 
estimated by using the billing times recorded by the S.U. Com- 
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TABLE 1 
Running Times In Minutes* 


Method - Compile Number Genera: 


1000 10,000 90,000 
A 1.27 1.29 1.55 4.07 
time/1000 .020 .026 .031 
B 1.19 1.25 1.92 8.21 
time/1000 .060 .078 .078 
c 1.13 1.22 1.95 8.49 
time/1000 .090 .082 
D 1.16 1.21 1.93 8.66 


^ all times include compile time and grouping time. 


Method A: mixture of sub-distributions. 
B: random inverse integration (after Stoloff). 
C: sum of twelve uniform numbers. 

D: inverse integration (after Burr). 


puting Center which includes compiling time, but which does not 
include output time. By subtracting the compiling time, an estimate 
of real computing time is obtained. 

Subsequent to this study, we obtained a subroutine which directs 
the computer to print the actual time at any instant during opera- 
tion. This was used by re-running the mixture method and the 
actual time agreed with the above estimation method to within 
two hundredths of a minute. Consequently, the other methods were 
hot re-run, 

The table shows that the mixture method is clearly superior, 
having a rate per thousand normal numbers generated of better 
than 2.5 times as fast as the random inverse integration method 
Presented by Stoloff. (An improved version of Stolofi’s program 
Was used by scaling the maximum ordinate at one.) All four 
methods resulted in empirical distributions quite close to the theo- 
retical normal distribution. The first four moments of the test 
samples of 90,000 normal numbers generated by methods A and B 
are shown below: 


NL CE ——————— o 
_—————————L—LL — OOO  _V—n— 


Method A B 
Mean .0019 .0025 
Variance 1.0009 1.0020 
ms —.0123 — -0035 
m 3.02 3.04 
Range 9.270 8e 
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AN INFORMATION SYSTEM FOR INDIVIDUALIZED 
INSTRUCTION IN AN ELEMENTARY SCHOOL* 


ROBERT J. RICHARDSON 


Learning Research and Development Center 
University of Pittsb 


Ix conducting and studying systems for individualizing instruc- 
tion a major problem is the clerical and managerial functions of 
record-keeping and diagnostic data analysis. This report describes 
à preliminary version of an information system developed to assist 
in these tasks in the Project on Individually Prescribed Instruction 
(IPI) being carried out in the Oakleaf School of the Baldwin- 
Whitehall School district in suburban Pittsburgh. For the first 
three years of the project, these functions were performed manually 
by the teacher's aides and the teachers. With greater amounts of 
data accumulating, the need for a better system was evident. After 
studying the information required by the teacher to prescribe in- 
structional materials and procedures, a computer-oriented system 
was developed. The basic elements of the system are an IBM 
1282 Optical Reader with an IBM 534 Card Punch Attachment 
E. several COBOL programs written for à Burrough's 5500 Com- 
puter. 


Individually Prescribed Instruction Project 


At the beginning of the academic year, a series of placement 
are taken by each student to determine his ability in each 
IPI subject. After the placement tests indicate the student's position 


E————— 
*The research reported herein was ormed pursuant to Contract 
By16-043 with the Office of me m S. Department of Health, 
uation, and Welfare, Contractors undertaking such research under Gov- 
Brust Sponsorship are encouraged to express freely their professional judg- 
ent in the conduct of the research. Points of view or opinions stated do not, 
lore, necessarily represent official Office of Education policy. or position. 
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in the curriculum, a pretest is given. This pretest indicates the 
particular skills of a unit in which the student needs further work. 
in order to attain a mastery criterion. On the basis of these 
diagnostic tests, prescriptions are developed. These prescriptions 
list the materials and procedures by which the student is to be 
taught. The final exercise included in each prescription is a curric- | 
ulum-embedded test (CET). This test determines whether the | 
student needs more work in the same skill or is ready to progress 
to the next skill. After progressing through the unit in this manner 
until all the skills are mastered, the student takes a posttest. If 
the pupil attains a passing grade in all the sections of this test, 
he proceeds to the next unit and takes the pretest. If he fails any 
sections of the posttest, another prescription for that unit is made. 
These prescriptions continue until the student passes the posttest. 
This completes the regular cycle that is composed of pretest, 
prescriptions for teaching materials and CET’s, and posttests. With 
this cycle being repeated by each student, proceeding at his own. 
rate through the curriculum, Individually Prescribed Instruction 
achieves part of its goal. | 


IPI Information System 


The project to develop an information system that could be 
used with IPI was divided into two parts. The first dealt with 
collecting data and developing it into a computer-usable form. The | 
second part involved using a computer program to assemble the 
data into organized records and to return the information net 
essary for further teacher decisions regarding pupil prescriptions. 

In order to collect the data in computer-usable form, an IBM 
1232 optical reader with an IBM 534 Card Punch is employed: 
The 1232 using a special Punching Sequence and Code Modifi- 
cation feature allows the machine readable sheets to be design? 
for easy use by the clerks at Oakleaf. With this feature, the 
1232 punches only numeric characters, and the data collected hs 
to be specially coded. At the end of each day’s classes, the cards 
including pretests, work materials, CET’s and posttests, are take? 
to the computer center for processing with the COBOL programs: 
Once-a-year information such as IQ Tests, Achievement Tes 
and Placement Tests are added to the tape when they are available. 

The data cards from the optical reader are sorted and submit 
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io B-5500 computer. Before the program can process these data, 
a dictionary file containing all the information on the total possible 
sores on each worksheet and test is stored on the disk. An Old 
Master Tape contains all previous data. Program 1 adds the new 
information from the data cards to the old information on the 
Old Master Tape making the New Master Tape. Program 2 takes 
the New Master Tape produced by Program 1 and prints the 
necessary information for the teacher to make the next prescription. 
This tape is also available for other types of analysis between 
production runs. One of the main problems of this preliminary 
system is that runs can be made only once a day, because the 
computer center is several miles from the Oakleaf School. 


Future Plans 


With the experience gained from this initial system, a new 
project has been started to overcome this delay and to add further 
features to the system to facilitate school operation, the revision 
of inefficient teaching materials, research on the instructional pro- 
cess, and evaluation of learning outcomes. The new system's basic 
elements are an IBM 360/50, an IBM 1232 Opitcal Reader with 
IBM 534 Card Punch, and an IBM 1050 Terminal with card 
reader, Using the old data recording forms and Optical Reader, 
the 80-column punched cards are made. Now, instead of waiting 
for results until the next day, the cards are run through the 
1050 card reader which is directly connected to the University of 
Pittsburgh’s 360/50. A Basic Assembler Language program then 
assembles records on disk for each student. The disk allows im- 
mediate access of information from the 1050 terminal located at 
Oakleaf. This allows the teachers to obtain immediate answers 
to their requests, Since the disk has a limited capacity, the seldom 
used information will be stored on tape and the tape will be 
updated as needed. Currently under study is the possible develop- 
ment of programs which may suggest prescriptions to the teacher 
Which he can accept or amend. 
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PL/I PROGRAM TO SCORE THE EDWARDS PERSONAL 
PREFERENCE SCHEDULE (EPPS) 


M. I. CHARLES E. WOODSON 
University of California, Los Angeles 


| Tum program described here for scoring the Edwards Personal 
Preference Schedule (EPPS) is written in modular form to facilitate 
| modification. The scoring subroutine (EPPS) can be used with 
other main routines or modifications of the simple main routine 
Provided. The main routine provided reads cards punched in a 
standard format, scores responses, converts these scores to per- 
centiles for any of the four populations provided in the manual, 
and prints and punches the results. Scoring is handled as in the 
instruction manual (Edwards, 1959) with the exception that a 
single additional score (MIS), the number of items to which no 
. Tesponse was made, is added. 
| The program is written in PL/I for the IBM 360/75 and copies 
of the source deck, listing, and more detailed instructions are 
Available from the author. 


REFERENCE 
Edwards, A. L. Edwards Personal Preference Schedule Manual. 
New York: Psychological Corporation, 1959. 
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nder W. Astin. The College Environment. "Washington, D.C.: 
imerican Council on Education, 1968. Pp. xi + 187. $3.00 (pa- 
yerback) . 

alth of data is reported in this monograph from Astin’s 

tive analyses of the Inventory of College Activities (ICA), 
instrument completed by 30,570 freshman “observers of the 
fe environment” in 1961-62 on 246 ‘American campuses. The 
“administered under the auspices of the National Merit Scholar- 
| Corporation as a mail questionnaire, solicited descriptions of 
aspects of the college environment (peer, classroom, adminis- 
and physical) as well as the college “image” and certain 
] characteristics of the respondents. 
general plan of analysis involved computing mean scores 
ch institution on each item, factoring these institutional 
es separately for each of the sets of items and then correlating 
of the resulting factor scores with each other and a variety of 
er institutional indices. In this way 35 factors were identified, 
lescriptive of the peer environment, 6 of the classroom environ- 
nt, 4 of the administrative or disciplinary environment, 2 of 
‘physical environment, and 8 of the college jmage. 

ical of the use to which Astin has put these factors is the 
liption of different types of institutions. Differentiating schools 


ype of curriculum: universities evidenced highly competitive 


S independent, conscientious but passive in their classroom per- 
n ting little 


as fostering academic rivalry and were verbally aggressive 
classroom. In a similar vein descriptions are afforded of the 
ools subdivided by type of control; geographie location, race 
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teaching, counseling, and educational research. These relationships 
complexly intertwined, are a challenge to anyone concerned with 
the college student and his educational milieu, whether broadl 
or narrowly defined. That the educational researcher can profitabl 
use the kinds of assessments afforded by the ICA to give substanci 
(or the lie) to the well-entrenched folklore about college environ 
ments is nowhere better illustrated than by Astin’s own well 
executed study of measures of institutional quality as predictor 
of student achievement (Science, 1968, 161, 661-668). Controllin 
for the varying quality of student input Astin found institutional 
excellence only minimally related to the tested achievement o 
graduating seniors, underlining the need, established in the mono! 
graph, for a better analysis of college characteristics vis a vi 
their dependence on student characteristics. Possibly the only effec 
tive way for some colleges to change their images is not throu 
any administrative or educational revision but through entirely 
new criteria for student selection. 


Crurrronp E. LUNNEBORG) 
University of Washingto! 


Milton L. Blum and James ©. Naylor. Industrial Psychology: 
Its Theoretical and Social Foundations (3rd edition). New 
York: Harper and Row, 1968. Pp. xii + 633. $9.95. 


This is a big book, 633 pages, and it has a lot in it, perhap 
too much. It tries to cover all of industrial psychology, seen a 
the application of psychological knowledge and method to th 
human problems in industry, and this is simply too much for 
college text to do. The book contains 20 chapters covering selection 
and placement (6 chapters) ; performance measurement (1 chapter); 
training (1 chapter) ; motivation, satisfaction, and morale (5 chap: 
ters); leadership and Supervision (1 chapter); decision makin 
(1 chapter) ; organization (1 chapter) ; job analysis and evaluation 
(1 chapter) ; accidents, safety and fatigue (1 chapter); work en- 
vironment, (1 chapter); and human performance (1 chapter). The 
book is a much changed revision of Industrial Psychology and 
isons Foundations, originally published in 1949 and revise 
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aduate and graduate courses, it will be challenging and will 

bstantially stand on its own. If it is to be used in courses for 

siness men or industrial supervisors, as texts in industrial psy- 
ology occasionally are, it is probably too technical and too 
ficult. The authors themselves say that their level of presentation 
oo more sophisticated” than earlier editions. They are 
ght. 

As a text, the book leaves some things to be desired. Its didactic 
ectiveness is limited. Neither in its organization nor in the 
iting of the text are instructional objectives specified, nor is 
ere much apparent attempt to arrange the written text to achieve 
structional goals. It seems clear that the authors see their book 

a compendium of the essential knowledge in industrial psy- 
ology from which the instructor and his students may choose 

make up their study. The book will be rather tough-sledding 

t the student who, for whatever reason, may go it on his own. 
ften the ideas offered are sketchy and incompletely developed. 
he student will usually need the assistance of a very competent 
structor. 

The book does not characteristically set up problems or problem 
reas at the beginning of a chapter and then develop the pattern 
| thought and the techniques that the practicing psychologist 
light follow in solving these problems as a scientifically trained 
rofessional. Nor are there chapter summaries which help the 
udent to review in a systematic manner. Rather, the materials 
| the book tend to be laid out end to end without much relation 
) central themes and problems which might have instructional 
alue. The references are extensive and up to date and of con- 
iderable help to the reader who wishes to look further and deeper. 
The book's instructional effectiveness is further limited by a drab 
miting style which, at its worst, verges on impenatrability. This 
ailing is not unusual in psychology textbooks which surely must 
e among the worst written books found anywhere. To the degree 
hat these books are used as teaching devices, drab and unclear 
riting is a serious handicap. Such books tend to discourage stu- 
ents and the writing interferes with the effective communication 
f knowledge. Such is the case with this book where often drab 
nd dificult writing interferes with the extensive, well researched 
nformation which it contains. 

In their preface the authors say they wish to emphasize, as they 
ad not in previous editions, “emerging theory development in 
ndustrial psychology.” They recognize that the theoretical lacunae 
n industrial psychology are great and that their effort may have 
nly limited success. And they are right. E 

It would not be possible to describe all their ventures into 
heory. The one in motivation is the most impressive. They con- 
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cern themselves particularly with the theorizing of Maslow and | 
Vroom and make rather good sense out of a difficult and confused 1 
area. It would have been more to my liking had they integrated 
the chapters on motivation, job satisfaction, and morale much 
more closely. Certainly, motivation belongs in the same treatment 
with job satisfaction and morale and a more unified treatment 
would have led to a more satisfactory presentation. 

One of the less successful bouts with theory occurs in the chapter 
on attitude measurement. It is begun by a brief and sketchy d 
treatment of attitude theory. In an effort to develop a theoretical 
base for later discussions on attitude change the authors take us 
on a brief tour of attitude change theories. They choose the right | 
theories for exploration, but the tour is so short and the research 
so barely described that one is confused rather than enlightened, 
Such sketchy excursions into rather complex theory occur too often 
in this book. Theoretical analysis is valuable in teaching and in 
practice and adds greatly to this book, but the occasional failure f 
to explore the theories thoroughly and to integrate them into, 
the discussions of practical problems must be confusing to relatively 
unsophisticated students and other readers of this book. 

There are some very impressive things in this book. The several 
chapters on selection and testing, along with related chapters om 
Job analysis and performance appraisal, offer an amazingly com: 
plete and technically proficient treatment of these topics. Although 
the reasons for such an extensive treatment in a text like this is 
not clear, one must admire this work, including the rather detailed 
tables and illustrative materials that are presented here. 

The sections on organization and leadership are likewise ime 
pressive, though much briefer than the section on selection. Such 
chapters, along with those on decision-making and human per 
formance, introduce the student to some of the most productive 
and stimulating ideas in industrial psychology and help prepare 
him for more advanced courses and reading. This reviewer woul 
like to have seen more emphasis on these topics and on training 
and a better integration of these chapters with the other more 
extensively treated topics, 

There is one further comment to make on this book which bears 
the publication date, 1968. That comment is to the effect that this. 

; like almost all others of its class, pays little attention to 
some Serious problems and areas of research which are so muc 
With us in this time and which are clearly within the realm of 
industrial Psychology. Among these problems are those of thé 
vocational training of disadvantaged persons and hardcore un- 
employed. Surely industrial Psychology, a science and profession 
which 1$ concerned with performance measurement, industrial train- 
ing, attitudes and motivation, ete., must have a great deal it can 
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say and do about this problem. Industrial psychology should also 
be concerned with the employment of women, racial discrimina- 
tion in employment, vocational training in schools and other in- 

| stitutions, and the development and effective use of human re- 
sources in this nation and in underdeveloped countries. Yet this 
book, as well as others on its subject, says almost nothing about 
these problems. It is unfortunate to fault a good and thorough 

book for omissions but it seems long past time that industrial 
psychologists in their practice, and in the writings concerning 
that practice, confront the major problems of their civilization. 


Howard G. MILLER 
North Carolina State University 


Charles M. Bonjean, Richard J. Hill, and S. Dale McLemore. 
Sociological Measurement: An Inventory of Scales and Indices. 
San Francisco: Chandler, 1967. Pp. xiv + 580, $12.00. 


The Bonjean, Hill, and McLemore volume consists of a 500- 
page classified inventory of scales and indices used in sociological 
measurement. In view of the diffculty of making a clear distinction 
between a scale (unidimensional measure) and an index (multi- 
dimensional measure), the authors have classified them together 
without distinction, and both will be subsumed under “scale” in 
this review. 

The primary classification of the scales is by conceptual class 
(achievement, leadership, social participation, etc.). Each of these 
78 conceptual classes is subdivided into topical categories, and 
under each category is listed the bibliographic references of all 
Seales relevant to that category. Where a scale has been used as & 
Measure of more than one concept or category, it is listed or 
Ctoss-indexed under each relevant rubric. Of the 2080 scales, 
7 were used or cited more than five times, and a discussion is 
Presented for these scales. Some of the scales are available com- 
Eu and publishers’ addresses are given in these cases. 
all ae Hill, and McLemore compiled the list of scales from 
in ose used or cited during the twelve-year period 1954-1965 
™ articles and notes published in the American Journal of Soci- 
eq!» the American Sociological Review, Social Forces, and Sociom- 
aa Which were taken to be representative of the main research 

l A a in American sociology. The bibliographie reference of a 
er at in an earlier period or published elsewhere is given 

Pr the case in which the scale was cited (but not used) in .— 

Sed entry journals during the period reviewed. _ 
of pee. of the project was the determination of the extent 
dna inuity in sociological measurement. The authors interpret 
Tesults as indicating a lack of continuity, citing as evidence 
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the finding that only about one-fourth of the 2080 scales were 
used more than once, with only one-tenth of these used more 
than five times. The authors do not mention a pervasive methodo- 
logical continuity: virtually all of the measures involve verbal 
techniques based largely on interviews, ratings, and question- 
naires. Sociological measurement based on behavioral observation, 
genetic analysis, physiological functioning, etc., has not yet come 
of age. 

A second purpose was the publication of an inventory of scales 
as an aid to the researcher who desires a list of scales which have 
been used previously in a specific area. This purpose has been 
fulfilled so successfully by the authors that the compilation may 
be considered a basic reference source in (largely American) socio- 
logical measurement. When supplemented with the Science Citation 
Index or similar work, the Bonjean, Hill, and McLemore bibli- 
ography will remain an essential source for many years: any item 
in that bibliography may be used in the future as an entry in 
Science Citation Index to determine whether any scale in the 
Bonjean, Hill, and McLemore bibliography has been used or cited 
after the publication of that bibliography. Its usefulness as a 
reference book is enhanced further by one index of names and 
another of topics. 

Prepared primarily for the field of sociology, the Bonjean, Hill, 
and McLemore book is applicable also to certain areas of edu- 
cation (particularly sociological) and psychology (particularly per- 
sonality and social). 


Epwarp LEVONIAN 
University of California, Los Angeles 


Walter Buckley (Ed.). Modern Systems Research for the Be- 
havioral Scientist; A Sourcebook. Chicago: Aldine Publishing 
Company, 1968. Pp. xxv + 525. $14.75. 


In his enlightened and entertaining Foreword, Anatol Rapoport 
defines a system as a “whole by virtue of the interdependence of 
its parts.” In turn, general systems theory “seeks to classify systems 
by the way their components are organized (interrelated) and to 
derive the ‘laws’, or typical patterns of behavior” (p. xvii) associ- 
ated with identifiable classes of systems. The impetus for this 
general theory was apparently evolved as a reaction to the failure 
of mechanistic, physical-analytical, approaches to deal with prob- 
lems in biology—problems which seemed to arise primarily as & 
function of the utter complexity of living organisms. The analytic 
methods of classical physics simply have not (according to Rapo- 
port) been able to deal with the complexities of even the simplest 
living organisms, not to mention their inappropriateness for human 
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behavior analysis and complexes of living organisms such as 
families, institutions, communities or nations. The methods of 
systems theory seek to facilitate prediction and control of the 
behavior of complex organisms and systems theory, most generally, 
seeks to put perspective on various levels of theoretical discourse. 

As presently conceived, by Buckley at least, modern systems 
research subsumes cybernetics, information theory, game theory, 
decision theory and something more narrowly defined as systems 
theory itself. It is obvious that this brief review can only hope 
to describe Buckley’s contribution in collecting these papers in one 
volume; we cannot comment on separate papers other than to note 
that their individual merits justify the collection independent of 
their contributions to general systems theory. 

This long (525 double-columned pages) book is a compilation 
of 59 papers which have been selected from a wide variety of 
periodicals and books; their publication dates range from 1939 
to 1968. Some of the chapters are primarily philosophical, others 
are quite technical and, therefore, may appeal to specialized in- 
terests of the reader, although none of the papers involve excessive 
mathematical complexity. The contributors include such familiar 
names in the behavioral sciences as Norbert Weiner, John von 
Neumann, Ross Ashby, George Miller, Anatol Rapoport, Gordon 
Allport, O. H. Mowrer, and Kurt Lewin among others. Absent 
are the efforts of ecologists whose formulations seem to embody the 
“general systems approach” or psychologists who are primarily 
interested in education and instructional processes. Chapters are 
arranged in seven Parts labelled: (I) General Systems Research: 
overview; (II) Parts, Wholes, and Levels of Integration; (III) 
Systems, Organization, and the Logie of Relations; (IV) Infor- 
mation, Communication, and Meaning; (V) Cybernetics: Purpose, 
Self-Regulation, and Self-Direction; (VI) Selt-Regulation and Self- 
Direction in Psychological Systems; and (VII) Self-Regulation 
and Self-Direction in Sociocultural Systems. 

In assembling these papers into a single volume, the editor 
permits the reader to perceive many commonalities among a 
diversity of topies. Surely one of the central aims, perhaps the 
central aim, of systems theory is to show common concerns existing 
among diverse scientific studies, and therefore Buckley must be 
congratulated for his juxtaposition of subject-matter as well as 
for the more obvious physical task of editing. 

By way of criticism, we would have preferred a more compre- 
hensive index as well as a greater attempt to correlate the respective 
papers rather than leave to the reader the task of generating his 
own general systems theory. Also jt might have been useful to 
describe similarities and differences among certain of the most 
frequently used technical terms. At a broader level, we regret 
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that there are no "systems" papers explicitly concerned with edu- 
cation and instructional processes; but this lack is hardly the 
editor's fault, the papers we seek have just not been written! 

In conclusion, we commend this volume to the attention of all 
who are concerned with the scientific analysis of highly complex 
organisms having to do with man and society. Behavioral scien- 
tists might wonder whether this general theory offers much which 
is greatly different from general “gestalt” approaches to psychology 
or even a religious belief in pantheism but Buckley’s volume will 
help each to make up his own mind. 


Ropert M. Pruzex 
State University of New York at Albany 


Norman E. Gronlund. Achievement Test Construction. Englewood 
ur N. J.: Prentice-Hall, 1968. Pp. ix + 118. $2.25 (paper- 
ack). 
Norman E. Gronlund. Readings in Measurement and Evaluation. 
New York: Macmillan, 1968. Pp. xv + 463. $4.50 (paperback). 


.These two books have two things in common. First, and ob- 
viously, they have the same author, and, second, they possess 
overlapping content. Because of these two facts they are being 
reviewed together. The first volume as the name implies deals 
exclusively with the subject of constructing achievement tests. 
Most other textbooks in measurement and evaluation have some 
chapters devoted to this topic and there are several books treating 
this subject exclusively. The essential merit of Gronlund’s book, as 
a textbook, is that it is not simply a book about the subject of 
achievement test construction, rather, it takes the careful reader 
step by step through the process of constructing many types of 
achievement test questions. 

The book begins with a chapter which emphasizes the importance 
of preparing a test which can be an aid to learning. There is a 
significantly helpful table on page 6 which illustrates the prep- 
aration of a table of specifications. With the use of such a table 
a teacher 1s not apt to prepare a test asking for the recall or 
recognition of facts alone and fail to include questions of under- 
standing and application of facts and principles. This table of 
specifications is repeated in more detail in chapter two, planning 
the test. Here five steps of planning are emphasized. The first two 
steps are identifying and defining the learning outcomes in be- 
havioral terms. The third is outlining the subject matter content 
to be measured by the test. The fourth is preparing a table of 
specifications and fifth, following the table in preparing questions. 

The Text three chapters take the reader through the process of 
constructing the most commonly used objective and the essay 
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type test questions. The major resources are the two taxonomy 
volumes, by Bloom and others (1956) and Krathwohl and others 

(1964). In these three chapters the author begins with the cogni- 
live and the affective objectives and illustrates with the subject 
content of this book how to combine content and the taxonomy 
major subheadings into test questions. Reasons are presented for 
& particular type of test question being most appropriate in ap- 
praising the attainment of certain types of objectives. The cate- 
gories of objectives, comprehension, application, analysis, synthe- 
sis, and evaluation ean be recognized as coming from Bloom's 
Taxonomy. An abundance of excellent test questions results. Oc- 
easionally a poor question is inserted to illustrate the type of 
statement to avoid. This is followed by a contrasting better ques- 
tion or item. 

The three concluding chapters deal with assembling, adminis- 
tering, and evaluating the test; statistical treatment of test re- 
sults; and the major criteria of a test—validity and reliabiliy. 
The statistical treatment is brief and simple. The more complicated 
grouping of scores into a frequency distribution is avoided so it 
is possible in fourteen pages to explain and illustrate to the 
student-reader how to compute a test’s reliability coefficient, de- 
termine the error of measurement, and assign stanines to pupils’ 
scores, 

„The last chapter, which treats validity and reliability, is ob- 
viously necessary for one to know, if he is to construct excellent 
achievement tests, but there is nothing new in the chapter. It 
ren the Standards for Educational and Psychological Tests 

6). 

Each chapter closes with a well selected list of references. The 
reader may notice considerable duplication of these references 
from chapter to chapter. Ahmann and Glock (1967), Ebel (1965), 
Gronlund (1965), Stanley (1964), and Wood (1960) are repeated 
in practically all of the additional reading lists. One could logi- 
cally conclude that Ebel’s and Wood’s books on achievement testing 
are considered to be the other good books on this subject. The 
three other references are all standard textbooks in the measure- 
ment field and when any one of the three is used as a basic text 
it may well be supplemented by Gronlund’s Constructing Achieve- 
ment Tests in order to provide emphasis upon this topic. ; 

The second and larger of the two books being reviewed contains 
453 pages of text, forty-eight different readings classified under 
eight topics appropriate to the total measurement field. The first 
four of these: The Measurement and Evaluation Process; Con- 
structing Classroom Tests; Interpreting Test Scores and Norms; 
and Validity and Reliability contain twenty-three readings which 
elaborate upon the contents of Constructing Achievement Tests. 
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The other four general topics are: Selecting Standardized Tests; 
Standardized Testing; Using the Results of Measurement; and 
Trends, New Developments, and Current Issues. These four general 
topics have no close counterpart in the achievement testing book, 
However, as a book of readings with such a wide coverage of 
measurements topics, it can be employed effectively to supplement 
almost any of the basic twenty textbooks listed immediately 
following the table of contents, If Constructing Achievement Tests 
is selected as the basic textbook for a college course, the contents 
of the first four parts of the book or readings correlate very nicely 
with the eight chapters of the test construction volume. The articles 
appear to have been selected and placed into the book specifically 
to enhance this relationship. 

All of the readings are extremely well selected and are authored 
by professionals in the Measurement and evaluation field. Among 
the authors are the following well-known names: D. R. Krath- 
wohl, L. J. Cronbach, R. W. Tyler, R. L. Ebel, R. L. Thorndike 
and A. Anastasi. 

A book of readings is assembled and published for the purpose 
of supplementing the basic textbook of a course. As such it is 
presumed to possess three characteristics. First, each reading 
should be related to some significant topic or concept covered by 
the textbook. Second, each reading should treat the topic to which 
it is related in somewhat greater depth than the basic textbook 
is able to do, Third, each reading should be authored by an 
authority or specialist in the topic covered. Does this.book of 
readings possess these three charaeteristies? In the reviewer's opin- 
lonit gets an A grade on all three of these desired characteristics. 
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Harry Kaufmann. Introduction to the Study of Human Behavior. 
Philadelphia: Saunders, 1968. Pp. viii + 162. $2.50 (paper- 
back). 


This book is more than just another introductory text to the 
study of human behavior. It reflects a philosophy, an attitude 
about what psychology is and, even more important pedagogically, 
m psychology can offer to the intelligent student to enrich his 
ife. 

This book is written by a psychologist who obviously has been 
unhappy with respect to the “food” we have been feeding people 
who are about to enter the field of behavioral science or those 
who have already entered the field and are debating whether they 
wish to remain. Behavioral scientists as a group seem to have a 
knack for making the most complex and fascinating being on 
earth (man) into a dry, forbidding, and remote creature. The 
author has achieved his purpose as stated in the preface: “The 
book tries to be many things to many people. The risk is great; it 
is difficult to avoid both the abstruse language of narrow speciali- 
zation and the dreary glibness of the popularizer; to walk the 
Tazor's edge, being neither dry nor vapid, allowing the reader 
glances at vast erudition gracefully borne, and flavoring urbane 
solemnity with dashes of elegant, but not frivolous wit” (page iy). 

The author has the flowing and stimulating style of Henry James 
coupled with the training of a psychologist. His prose 1s so excellent 
that one almost forgets that one is learning basic scientific prin- 
ciples simultaneously. Through literary style and appropriate and 
interesting examples, he leads the reader through such usually 
abstract and cold issues as independent and dependent variables, 
hypothetical constructs, research design, Type I and II errors and 
the like. Sections of the book also discuss basic philosophical con- 
cepts presented in a serial fashion and gradually developing from 
a few basic thoughts to a full-blown scientific and theoretical 
approach to the study of man. M, i 

One is sorely tempted to use this book as the basic text in 
Psychology in spite of its relative brevity and the lack of the 
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usual plethora of rat, monkey and human studies. Actually, if one 
were to confront most behavioral scientists and ask them to 
describe the very essence of psychology, they would probably reduce 
the presentation to what is available in this book. However, prob- 
ably the most appropriate suggestion is that this book be used 
in conjunction with a heavily content-oriented text. In this manner 
if the professor ever wonders just what students received from 
his course (since an appreciation and growing love of the scientific 
method is rather difficult to measure) he can always rely on what 
students have memorized from the other text. 

The book is strongly recommended to readers of EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT since it fosters a true under- 
standing and appreciation of the necessity of measurement as 
well as the fact that it provides information on the basic "tools" 
of measurement. Wherever this book is used it can be depended 
upon to stimulate interest and questioning on the part of the 
reader. Professors whose discussion groups seem to be lagging 
should seriously consider adding this text to their course. So many 
psychology texts are of such a nature that the only questions 
arising from their reading are ones of misunderstanding. This 
text should provide food for a provocative discussion of some of 
the most basic elements of which the behavioral sciences are 
composed, 

Pump S. Very 
Rhode Island College 
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ger E. Kirk. Experimental Design: Procedures for the Behav- 
toral Sciences. Belmont, Cal.: Brooks/Cole (Wadsworth), 1968. 
pP. xii 4- 577. $12.50. 


Every now and then the combination of ublisher and author 
produces a book that is not only intellectually satisfying but also 
imparts a kind of aesthetic pleasure. Kirk’s new book is one 
of these. It is a pleasure to read and a pleasure to handle. 

t the publisher and the author are to be compli- 
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one has to begin to qualify a little. The pleasure 
the empathy of one professional with another. For 
(and the reviewer is currently using this text with an 
statistics class of some size) the appeal has not proven 
asked why, the response is modally that the book 
know more than you do”—and this may well be 
does one stand? Without hesitation one recommends 
nonmathematical, active researcher who may have 
all this once before. For the mathematically-inclined, 

little in the air, and for the uninitiated there is 
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apparently some fogginess. Lable yourself as you will, and function 
accordingly. 

The claim for the book is a simultaneous affirmation and denial 
of this reaction. It was intended by the author to serve as a 
reference book and as a text for a two-semester course in experi- 
mental design. On looking in a library at the catalogs of 40 insti- 
tutions all of reasonable and above-average repute, only two of 
them extended their nominated-design course over two semesters. 
This may, then, be the key—the book may well not be appropriate 
for the faster pace of a one semester course. The student-reaction 
of difficulty could perhaps be ascribed to pace, not product. 

For all this hedging, the sense of excitement on first browsing 
through the book remains, Insofar as the book has attempted to 
bridge the gap between the volumes on the statistical theory of 
experimental design and the few naive texts on the market, it has 
probably succeeded fairly well. Insofar as he has tried to select 
and illustrate designs and techniques of the greatest potential use- 
fulness to behavioral research, the author has done an excellent 
job. And insofar as any author wants to produce a work that is 
attractive to a certain audience, Kirk should find a favorable re- 
sponse among many, many researchers in the field. 

Before examining the content in some detail, one or two other 
general features of the book are worthy of comment. The same 
numerical data form the basis of the illustrative examples through- 
out the book. Adjustments of a minor nature are used for the 
different hypothetical problems, but one rapidly becomes familiar 
with the basic data and the nuances of the hypotheses that arise 
in the various design-situations become remarkably meaningful. 
A second feature is the useful bibliography at the end of each 
chapter; and with it an exceedingly valuable summary of advantages 
and disadvantages of each model. These two features provide both 
a kind of closure and a kind of horizon-opening. The closure exists 
for those who are satisfied with the included content; for those 
Who are not, the suggested additional readings are uniformly ex- 
cellent. There are no problem-sets (a pedagogic disadvantage), 
though the coding system for designating each experimental de- 
sign is often enough of a problem to constitute some kind of 
substitute. 

The book consists of thirteen chapters and a seventy-page 
appendix, Appendices have always held a fascination for this re- 
viewer—they often indicate how many loose threads the author 
couldn't, catch into the weave of his text. In this case the appen- 
dices are literally fun to browse through—they range from a 
rather mundane effort on the rules of summation and expectation, 
through a note on orthogonal coefficients to one of the more useful 
and interesting sets of tables one can find. For those fellow souls 
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who like to start at the end of the book to see whodunit, there 
are some rewards in store. 

Chapter 1 of the main text attempts to do a great deal and 
does some of it well. The chapter is essentially a review of what 
is to be assumed by way of prior statistical sophistication—the 
vocabulary and procedures of statistical inference. The vocabulary 
part makes pretty stodgy reading (who really wants to turn to 
page 2 of any book to be confronted with an abortive Funk and 
Wagnall’s?), and the implicit philosophy of science of the first 
part of the chapter causes some wriggling-in-annoyance reaction, 
but for the most part this is a very valuable chapter. Specially 
if you have a lot of patience. The author has a penchant for 
acronyms (who would guess that YBIB-t was a Youden square 
balanced incomplete block design?) that appears on page 12 and 
sticks with us throughout and that would make the most imagina- 
tive Department of Defense official green with jealousy. No— 
chapter 1 is the kind of chapter that reads best after the rest of 
the book is finished, even though one applauds its intended pur- 
pose. 

From chapter 2 on, the effect is startlingly different. One soon 
settles down to the style of writing, which is clear, and enjoyable 
if a little dogmatic in its value statements. Chapter 2 itself is 
devoted to the basic sampling distributions, chi-square, £, and F. 
The content is excellent—given that one isn't looking for a rigorous 
mathematieal presentation this chapter ranks as a high point of 
the book. The discussion of the F-distribution eases elegantly into 
the partitioning of the sum of Squares and the basic fixed effects 
linear model for analysis of variance. "The section on the effect 
of failure to meet assumptions of the model has been done better 
i a number of other places, but the summative effect of the 
chapter is extremely positive. 

Chapter 3 is an introduction to multiple-comparison tests, and 
maintains much of the excellence of chapter 2. The presentation 
d considerably more impactful than most explieations of this 
Aib deserves special attention for the pedagogic model it 

Chapters 4-11 are each devoted to a particular kind of experi- 
P cm pis E designs s completely randomized, 

atin i š 
a E AM Factorial designs (3 chapters) 
The ANOVA model somehow always seems to divide a student 
oe three groups: those who revel in it, the indifferent mass, 
T e group who tolerate it because they have to. Personally, 

a MUS found the presentation a little too slick mathemati- 
GE ps v Eus Admite a considerable quantitative bias. The 

Sloppy mathematically—just that the simpli- 
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fications often seem to be over-simplifications, and the symbols 
that are introduced to avoid the complications of subscripts a 
little strange to start with. If one could honestly divest oneself 
of a bias, these eight chapters on the basic designs must rank 
among the most useful in the field, and from empirical usage 
the proportion of students remaining in the indifferent group was 
considerably less than with any other text the reviewer has used. 
There is almost nothing one can say about the content—it is just 
what one would expect in chapters of the titles indicated, but it 
is modern, logical, clear and even exciting at times. Some ex- 
tremely enlightening little diagrams and tables accompany the 
explanations, and enhance them considerably. In short, one would 
have to search pretty hard to find an alternative text that 
handles the material so well. There is not much that is unique by 
way of substance, but it is presented most appealingly. 

The last chapter is an analysis of covariance, and the same 
excellence of presentation pertains. 

In summary, then, one must admit that the aesthetics we 
spoke of earlier probably over-rode the content in whetting our 
enthusiasm. The content is fairly ordinary, and it does cover the 
kind of intermediate ground proposed for it. But it is presented 
in such a fashion as to leave competitive texts in umbral array 
and is certainly very worth suggesting to students as a textbook 
of merit. 

PETER A. TAYLOR 
University of Manitoba 


Rosert B. Mirer. Statistical Concepts and Applications: A Non- 
mathematical Explanation. Chicago: Science Research Associ- 
ates, 1968. Pp. 192. $7.45 (paperback). 


This book was written for the intelligent and educated layman 
who is neither a mathematician nor a statistician but who wants 
to understand the function of experimental statistics and be able 
to interpret statistical results to make decisions and plan pro- 
grams. The author makes no attempt to develop any computational 
skills in the reader; except for the most basic descriptive measures, 
Computational illustrations are rare. park 

The book contains most of the topies usually found in intro- 
ductory or intermediate applied statisties books, but the presenta- 
tions are oriented toward business and engineering and the con- 
cepts and the applications of statistics are discussed only in a 
cursory manner. There is also a final section which is devoted to 
Models in business and industry. E à 
The author's intent to keep the content brief resulted in the 


$ 


222 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


most serious weakness of the book. Several important topics, e.g., 
experimental design, normal probability curve, sampling distribu- 
tions, standard error, significance tests, and the anaylsis of vari- 
ance, are not discussed in sufficient depth to give the layman of 
statistics an adequate foundation for understanding these concepts 
or interpreting related statistical results. The z-score and other 
standard score transformations whieh are commonly used by edu- 
cators and psychologists are not discussed. No mention is made of 
either the t-test or the F-test, and the ANOVA table is not 
illustrated. These shortcomings mitigate against the author’s in- 
tent to enable the reader to meaningfully interpret statistical 
results, 

The discussion of certain statistical concepts is not only brief 
but frequently ambiguous. Descriptive statistics are defined as 
generalizations rather than summarizations (p. 18). Several exam- 
ples of inferential statistics are projections of trends into the 
future rather than generalizations from the sample to the popula- 
tion (pp. 18, 28). Variability is defined within the limited context 
of industry so that its meaning in the behavioral sciences is not 
readily apparent (pp. 46-47). The discussions of the platykurtic 
distribution curve (p. 53) and the interquartile range (pp. 66-67) 
are not clear. The section on confidence intervals is completely in 
error; sample statistics are utilized for parameters (pp. 138-139). 

Several other errors and/or ambiguities are frequent throughout 
the book. The construction of the histogram given on page 50 is 
incorrect. Only the maximum-likelihood estimate of the population 
standard deviation is given (p. 70) which causes a slight error 
in the formula for the standard error of the mean (p. 136.). The 
analysis of variance is said to test the ratio of the variability of 
one treatment level to the combined variability of all treatment 
levels (p. 140). The predicted Y value is presented as the mean 
of the Y values for an X value (p. 91). The term “correlationship” 
is either a typographical error or an unnecessary coining of a new 
term (p. 161). The point-biserial is said to be the same as the 
biserial correlation (p. 187). 

An understanding of correlation and prediction could have been 
facilitated by discussing the z-score formulas for these statistics. 
The section on correlation could also have been improved by 
illustrating with scattergrams the various degrees of relationship, 
in addition to perfect and zero correlation, that can exist in 4 
bivariate distribution. 

The organization of the book presents minor difficulties. For 
example, the sections on frequency distributions and shapes of 
distributions (pp. 48-53) should be integrated with the overview 
of descriptive statistics (pp. 26-27). In addition, the nine-page 
summary of major statistical ideas at the beginning of the book 
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is redundant since the material is included verbatim at the begin- 
ning of the corresponding major sections. 

Because certain important topies have been omitted and others 
are treated superficially, this book may be useful for those edu- 
cators and psychologists who wish to obtain a general orientation 
to the role and applications of statistics. The book may have 
greater usefulness in business, for which it is primarily intended, 
but it has not achieved its purpose of giving a sufficient, under- 
Standing of statistical concepts and procedures to allow a proper 
and safe interpretation of statistical results. 


Grenn H. BRACHT 

Kennetu D. Horxins 

Laboratory of Educational Research 
University of Colorado 


Samuel H. Osipow. Theories of Career Development. New York: 
Appleton-Century-Crofts, 1968. Pp. xi + 259. $5.75. 


The main purpose of the book is to describe and assess the 
major theories of career choice and related research. Each theory of 
career choice is fully developed and supporting research evidence 
presented. Although the author presents an evaluation of each 
theory, the research evidence is sufficient so that the reader can 
make his own judgments regarding the relative strengths and 
weaknesses of each theory. Elements common to the various theo- 
Ties are identified and are synthesized with the hope that a prac- 
ticing counselor can find them useful in his work. 

Osipow suggests that career theories can be classified into four 
categories: sociological theories; self concept developmental theo- 
ties; trait-factor theories; personality-in-career theories. The spe- 
cific theories discussed in the various categories are: Roe's per- 
sonality theory of career choice; Holland’s career typology theory of 
Vocational behavior; the Ginzberg, Ginsburg, Axelrod, and Herma 
theory ; and, Super’s developmental self-concept theory of voca- 
tional behavior. The psychoanalytic conceptions of career choice 
include those of Brill, Bordin, Nochmann, and Segal. The chapter 
entitled Personality and Career contains the following sections: 
Psychological needs and careers; occupational values and careers; 
personality style and vocational behavior; psychopathology and 
careers; and personality traits and careers. The chapter entitled 
Social Systems and Career Decisions: The Situational Approach 
reviews a variety of sociological concepts that have implicit or 
explicit implications for career development. 1 

Each chapter usually consists of four sections. The first section 
Contains the general nature and scope of the theory and its basic 


224 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


thesis. The second section reviews the results of research generated 
by the theory or that are relevant to it. The third section discusses 
the status of the theory and the implications it has for counseling. 
In the final section the theory is evaluated in respect to criteria 
pertinent to theory construction. Expectations for the future de- 
velopment of the theory are discussed in the concluding para- 
graphs. 

Osipow's explieation of various theories of career development 
is excellent. His attempt at a synthesis is below par when com- 
pared to the rest of the book. Since the book is primarily con- 
cerned with matters about career development rather than with 
how to assist a person with his occupational choice, this failing 
does not seriously detract from the value of the book. It should be 
primarily used in classes concerned with occupations than with 
counseling. The book's utility lies in its compact presentation of à 
variety of ideas about career development. Students in counselor 
preparation programs will find this text to be an excellent intro- 
duction into the complexities of career development. 


Henry KACZKOWSKI 
University of Illinois 


A. I, Rabin (Ed.). Projective Techniques in Personality Assess- 
ment: A Modern Introduction. New York: Springer Publishing 
Company, 1968. Pp. x + 638. $11.00 


This volume, eonsisting of 19 original chapters by 18 contrib- 
utors, was designed to meet a need perceived by the editor for an 
introductory text on projective techniques. The book is divided 
into seven sections, with the first part devoted to the historical 
and theoretical backgrounds of projective methodologies and to 
problems in validity and assessment. Five sections are devoted to 
specific techniques, grouped according to the type of response 
elicited from the subject. Part II is devoted to inkblot association 
procedures: Rorschach (Beck) and Holtzman inkblots (Holtzman) ; 
Part III to constructive techniques: Thematic Apperception Test 
(Rosenwald) and varieties of thematic methods (Neuringer) ; Part 
IV to completion methods: word association and sentence com- 
pletion (Daston) and story completion (Lansky); Part V to ex- 
pressive methods: doll play and puppetry (Haworth) and draw- 
ings (Hammer) ; Part VI to extensions of the projective hypothesis: 
the Bender Gestalt (Hutt), the intelligence test in personality 
assessment (Blatt and Allison), and other projective techniques 
(Campos). The last section consists of four chapters devoted to 


Rabie ic of projective methods to clinical and research prob- 
lems. 
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It is tempting to compare this recent volume to edited works 
on projective methodologies that appeared in an earlier day: 
Abt and Bellak’s Projective Psychology (1950) and Anderson and 
‘Anderson’s An Introduction to Projective Techniques (1951). 
The editor of the latest collection, A. L, Rabin, and Samuel J. 
Beck are the only survivors from the previous works to the current 
one, Each of the earlier works contained a chapter devoted ex- 
clusively to the Szondi Test; the current text treats the Szondi as 
just one of a variety of projective techniques and with some 
reservations (Rabin’s chapter in Anderson and Anderson also 
raised some questions about this particular technique). Missing 
from the current volume but awarded separate chapters in pre- 
vious years are such techniques as finger painting, graphology, 
and the Rosenzweig Picture—Frustration Test. Newer procedures 
not mentioned in the previous books include Holtzman inkblots, 
the Hand Test, Kahn Test of Symbol Arrangement, and a variety 
of thematic procedures (Blacky Pictures, Michigan Picture Test, 
ete.). There appears to be a certain amount of fad and fashion 
in the projective domain, but it is not clear whether shifts in 
interest are essentially due to results of validity research. 

The most significant differences between the earlier volumes and 
the present one are the strenuous efforts to relate projective method- 
ologies to nonclinical areas, such as cognition, information theory, 
perception, and motivation, and the greater concern for validity 
and the nature of validity studies. Gone are such uncritical and 
perhaps meaningless statements such as, “The ‘gorilla’ relished 
by magazine readers, when regarded from the standpoint of the 
clinical psychologist in Rorschach work rather from that of the 
magazine writer, signifies the projection of self in an effort to 
depict the baser side of the personality” (from Abt and Bellak, 
p. 84). Walter Klopfer, in his chapter in the Applications section, 
addresses himself to the problem of vague and universal state- 
ments in psychological reports based on projective techniques. 

The current concern about validity research has brought about 
an increased interest in the nature of the criterion. Several writers, 
but particularly Karon, question validity studies in which the 
criterion is only dimly understood, such as pass-fail in pilot train- 
ing, or in which the projective techniques are required to predict 
Tare or unusual events, such as suicidal efforts or elopement from 
an institution, or distinguish between nosological groups, or cor- 
relate with a criterion employing a sample with markedly reduced 
variance. Certainly the results of correlating grades in graduate 
School with a selection device such as the Graduate Record Exam- 
ination should lead psychometrists to be more charitable in their 
interpretations of research on the validity of the Rorschach or any 
other projective technique. 
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One of the more challenging chapters is that by Bertram P. 
Karon on the problems of validities. One of the points raised by 
Karon is that classical test theory a la Gulliksen is not only 
inappropriate for projective techniques but needs reexamination 
for psychometric purposes as well. The reader interested in test 
construction will want to examine his mathematical proof (pp. 
97-102) that zero inter-item correlations do not limit validity 
and that “reliability is largely an irrelevant consideration.” While 
this may be true for internal consistency, as Karon argues, it may 
be question-begging to assume that projective techniques are meas- 
uring temporally unstable traits and that low reliabilities of such 
measures may be interpreted as evidence for the validity of the 
instrument. Surely there are important personality characteristics 
that do not show rapid fluctuations from day to day or even 
from year to year. 

For the members of the intended audience for this book, the 
student in need of an overview of the field of projective methods 
and a glimpse into the problems and applications of these methods, 
this text should serve admirably well. It is relatively free of the. 
rigidity and vague clinical “insights” of previous edited works. 
For the professional in test construction and in general psychology, 
there is much material that will be of more than passing interest. 


Pamir HiMELSTEIN 
The University of Texas, El Paso 


Robert Rosenthal and Lenore Jacobson. Pygmalion in the Class- 
room. New York: Holt, Rinehart and Winston, 1968. Pp. xi 
+ 240. $3.95 (paperback). 


The reader of this review will recognize Robert Rosenthal as 
the psychologist who, over the past few years, has been collecting 
evidence and stimulating research by others concerned with the 
effects of experimenter expectations on the results of experiments. 
Some of the evidence of such experiments, in which animals ranging 
from planaria to men have been employed as subjects, is reviewe 
in Chapter 4 of the present volume. In this book, however, the 
animals of primary interest are public school children and their 
teachers. 

Rosenthal’s general thesis is that “. . . one person’s expectation 
for another person’s behavior can qui unwittingly become à 
more accurate prediction simply for its having been made.” After 
describing observational and experimental support for this hypoth- 
esis in clinic, laboratory, and everyday contexts, the authors proceed 
to detail a study conducted at a public elementary school in 4 
lower class community; about one-sixth of the school population 
was Mexican. The purpose of the experiment was to determine 
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what the effects would be of indicating to their teachers that a 
random sample of pupils will show a significant “spurt” in in- 
tellectual functioning, or “blooming,” within a year or &o. At the 
beginning of the academic year all of the children in the school 
were pretested with Flanagan’s Test of General Ability to obtain 
total, verbal, and reasoning IQ's. Twenty per cent of the children 
were identified as “potential spurters”—actually at random but 
ostensibly by means of the test—to their teachers. The same IQ 
test was readministered to all of the children in posttesting sessions 
one semester, one academic year, and two academic years later. 
Intellectual growth was defined as the difference between à child's 
pretest IQ and his IQ on one of the posttests. The major research 
questions was: Will the experimental group (expectancy group) 
show greater gains in IQ than a control group of the remaining 
children (all those not identified as “potential spurters”)? In gen- 
eral, the expectancy children did show greater gains in IQ than the 
controls, although there were significant interactions between gain 
scores and grade level, nationality, sex, ability track, and other 
variables, For example, children in grades one and two, Mexican 
children, and those in the medium ability track showed the greatest 
initial gains; boys showed greater gains in verbal IQ and girls in 
reasoning IQ. 

The gains shown by the expectancy group were not limited to 
IQ. They also gained more than controls in reading grades and 
Were rated by their teachers as more intellectually curious, happier, 
and in less need of social approval than controls. 

In spite of the rather dramatic changes shown by the expectancy 
children, the investigators were unable to identify any specific 
teacher behavior which might have caused such changes. The 
teachers did not appear to spend more time with the expectancy 
children and even failed to remember that many of them had 
been identified at the beginning of the year as “potential spurters.” 
In his studies of experimenter effects, Rosenthal has confessed 
that he does not know precisely what experimenter behavior pro- 
duces the effects, and the same admission can be made regarding 
the results of the present investigation. The authors speculate 
that the teacher treated the expectancy children differently from 
the controls—through facial expressions, postures, and perhaps 
touch, but they cite no evidence for this proposition. i 

Rosenthal and Gregory, like Fechner before them, have antici- 
pated their critics by considering alternative explanations for the 
results, viz., test unreliability, pretest IQ differences, artifacts in 
the testing process, and other methodological flaws; they discuss 
and dismiss each of these in turn. And it may well be, as Fechner 
declared of his psychophysics, that the results of Pygmalion in 
the Classroom will stand because critics will not agree on how to 
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explain them. Certainly Rosenthal and Gregory have produced a 
provocative experiment, and although critics should have no prob- 
lem attacking it, the array of significant findings makes it difficult 
to destroy. It should be noted that the control group samples were 
much larger than the experimental groups, that gain scores are 
open to question, that many of the significant differences may 
have been caused by the scores of only a few children, and that 
the results of the present experiment are in direct contrast to those 
obtained by Judy Evans (p. 96). Nevertheless, even as a minimum 
accomplishment, the results once more call into question the mean- 
ing and stability of test scores and other evaluations. They also 
point to a need for a reassessment and more careful analysis of 
the effects of teacher behavior, both verbal and non-verbal, and 
teacher attitudes on the attitudes, self-concepts, and performance 
of school children. 


Lewis R. AIKEN, JR. 
Guilford College 


Gilbert Sax. Empirical Foundations of Educational Research. 
PET: Cliffs, N. J.: Prentice-Hall, 1968. Pp. xiii + 443. 


This is a book which contains the usual (for books on research 
methodology) potpourri of philosophy of science, measurement, 
statistics, experimental design, and technique for library research, 
and data collection and analysis, The brand of stew served up by 
Sax is a little thin, even if fed in a concentrated one-semester 
dose. However, it is possible for an instructor to use the excellent 
list of references accompanying every chapter as a menu from 
which to feed his students heartier intellectual fare. In fact, this 
was how I used the book in a summer course for teachers working 
on a Master of Education degree. 

Bax has organized his material into 13 chapters which, as he 
points out in the preface, proceed fairly logically from the selection 
of a topic to data analysis and presentation of results. Chapter 
One provides and introduction to the philosophy underlying the 
scientific enterprise. There is a good statement in this chapter 
on the use of models in scientific work. Chapter Two consists of à 
discussion of the role played by research in education and a brief 
history of educational research. In the third chapter the focus is 
on what to research. Here Sax rightly emphasizes the importance 
of the problem rather than the methodology, a distinction students 
often fail to make. An innovation (which may have the ultimate 
effect of quickly dating the book) is a section listing potential 
research problems in each of several areas, This may be helpful 
to students (and instructors) who have not developed their own 
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ideas. Chapter Four contains a fairly pedestrian treatment of the 
review of the literature and how to write it. The fifth chapter is 
an important one on the research hypothesis. Here Sax defines the 
research hypothesis, distinguishes it from the null hypothesis, justi- 
fies it as an essential part of the method of science, and suggests 
how to write it. 

Chapters Six and Seven deal respectively with sampling theory 
and measurement. Simple random, stratified, systematic, and clus- 
ter sampling are discussed in Chapter Six as are ahe factors that 
bear on how large a sample should be. Chapter Seven defines the 
common types of reliability and validity coefficients, and discusses 
the factors affecting such coefficients. 

Chapter Eight provides a description of some distortions that 
often influence empirical observations and of several safeguards 
against such distortions. This chapter also contains a section on 
the critical incident technique, a presentation which includes a 
brief description of factor analysis. Techniques and devices for 
collecting data are described in chapters nine and ten. The former 
covers the interview and the questionnaire, the latter a larger 
selection of techniques including disguised and unstructured tech- 
niques (e.g, projective tests), disguised and structured techniques 
(e.g., information tests designed to measure attitudes), sociometry, 
the Q-sort, the semantic differential, and content analysis. Infor- 
mation about reliability and validity is provided in most instances. 

The descriptive approach to research is the topic of Chapter 
Eleven, Here Sax discusses the case study, and the sample survey. 
Under the heading “Correlational Studies” Sax differentiates re- 
lationship studies from prediction studies. Also in this chapter 
Sax considers developmental studies of the longitudinal and cross- 
sectional types, and cross-cultural inves igations. Besides describ- 
ing these types of studies, Sax provides information on the special 
problems encountered in doing each. 

In contrast to the emphasis on description in Chapter Eleven, 
Chapter Twelve emphasizes experimental methodology. The ap- 
proach adopted is to discuss experimental design using the basic 
framework developed by Campbell and Stanley (1963). In addition 
Sax suggests how to write the procedure section of a thesis, — 

The final chapter deals with data processing. Material here 
varies considerably in significance and covers the broad range 
from helpful hints on how to plan the processing operation to a 
discussion of the types of forms that may be used for recording 
data, Also included is a brief description of several types of 
Bm and an elementary introduction to FORTRAN pro- 

aming. 

This is a book which is made more useful to the instructor and 
student by the fact that every chapter contains a detailed sum- 
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mary, a set of exercises, and, as was mentioned earlier, an excellent 
annotated supplementary reading list. 

One disappointing feature of the book is its “thinness.” This 
would perhaps be expected from the fact that Sax covers such a 
broad range of topics in a relatively brief volume (it is almost 
300 pages shorter than Kerlinger's [1967] monumental work). 
It is impossible, for example, to do justice—even at the relatively 
unsophisticated level of a first year graduate student in education— 
to topics such as factor analysis in a single page of text or 
sociometry in four. But the thinness is also apparent in ways 
that should not be correlated with size. For example, the author 
claims to assume a knowledge of introductory statistics but no- 
where is there the intellectually satisfying development of sta- 
tistical concepts that such an assumption would support. Also, there 
is the failure in a few instances to define the assumptions under- 
lying the development of a concept. This occurs, for example, in 
the sections on the Kuder-Richardson formula 20 and the Spear- 
man-Brown formula. It also occurs in the discussion of simple 
random sampling (pp. 132-135) in which the example is of sampling 
from a finite population but the equation for the associated standard 
error of the mean is for sampling from an infinite population. 
This latter problem could, of course, have been avoided by a 
discussion of sampling with and without replacement. Finally, the 
book generates the impression of intellectual thinness because it 
includes a discussion of several peripheral problems, ones which 
might better be left for students to solve using common sense. A 
good example of what I am referring to is found in the chapter 
on reviewing the literature where a section on notetaking is found. 
Additional examples are found in the chapter on data processing, 
much of which would not apply within the constraints of a specific 
computer installation or research problem and which must, there- 
fore, be classed as unessential. 

But to focus attention on the foregoing limitations is perhaps 
to miss the more important problem of defining the role the book 
can play in graduate education. Sax himself asserts that the 
purpose of his text is to familiarize students with the philosophy 
and assumptions underlying empirical research, the contributions 
of empiricism to education, the potentials and limitations of the 
empirical approach and some of the techniques used by educational 
empiricists (see p. viii of the preface). This is modest enough. 
However, there seems to be much more to his purpose that Sax 
was not prepared to state. The earlier summarization of content 
suggests that the book is expected to prepare students to do re- 
search. At this point it must be asked whether this or any other text 
on research methods can really provide a basis for training people 
to do research? I think not! As Hebb (1966) has argued, training in 
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how to do research is obtained by apprenticeship. From this point 
of view, textbooks and graduate courses have validity only as train- 
ing for research. Because of its limitations, enumerated earlier, 
Empirical Foundations of Educational Research is unlikely, either 
by itself or as the basis of a course, to provide the kind of 
knowledge that will shorten a student's research apprenticeship 
or significantly improve the quality of work done during an appren- 
ticeship. Instead, the book seems best suited to providing terminal 
Master's degree students with an overview of the philosophy and 
methods of empirieal research in the (probably vain) hope that 
they will thereby become competent to read the literature on 
educational research with some degree of understanding and critical 
discrimination. 
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INTRA-INDIVIDUAL TEMPORAL VARIABILITY 
AND PREDICTABILITY! 


RALPH F. BERDIE 


Office of the Dean of Students 
University of Minnesota 


AvrHOUGH the assumption that behavior is predictable does not 
require that behavior be consistent, consistency of behavior may 
facilitate its predictability. If some person’s behaviors are more con- 
sistent than those of other persons, they may be more predictable. 
If one person leaves his apartment for work every morning exactly 
at eight o'clock, and another person's time of departure varies widely 
between seven-thirty and eight-thirty, then predictions regarding 
when the men would return from lunch, based on the time they left 
for lunch, presumedly might be more accurate for the former than 
the latter individual. Persons whose habits are regular, or whose be- 
havior is consistent over time, may be the most predictable. An 
informative review of the research on predictability has been pre- 
sented by Tolbert (1966). Fiske and Maddi (1961) discuss in 
detail the concepts of variability. 

Only suggestive evidence pertains to the relationships between 
consistency and predictability and two studies provide some in- 
sight. This author (Berdie, 1961) reported that the correlations 
between college aptitude test score and first quarter college grade 
Point average were .70 for a group of students with little variabil- 
ity shown on responses to a mathematics test and 44 for a matched 
group with high variability. Most of the analyses in that study 
provided results that failed to attain statistical signifieance, but the 
results were consistent and the conclusion was that intra-individual 
differences in variability were perhaps related to predictability. A 
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ir contribution in analyzing the data. 


235 


236 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


later duplieation of this study with somewhat similar methods pro- 
vided no confirmation. 

A more recent unpublished study by Arvey showed that 200 
college students could be divided into two groups on the basis of 
the variability of their high school grades. The correlation between 
predieted college grade point average based on test scores and high 
school rank, and obtained grade point average for the most varia- 
ble group was .10, for the least variable group .69. Similar cor- 
relations based on an independent cross-validation sample were 
23 and .74. Individuals whose behavior was more variable in high 
school were less predictable in college and intra-individual vari- 
ability was regarded as a practical moderator variable. (Saunders, 
1956). 

The use of a moderator variable in prediction depends on the 
purposes of prediction and the situation. Sex often is a moderator 
variable and academic prediction for girls tend to be more accurate 
than for boys (Seashore, 1962). A multiple correlation between 
first year grades and three predictors was .46 for men and .57 for 
women in the University of Minnesota Arts College. For the total 
group, R = .52. Although the moderator variable could provide 
better predictive efficiency for one group, it really did not contribute 
to selection for the total group. For predictive purposes in advising 
and counseling, however, it did effect probability statements, and it 
did provide incentive for further analysis of the prediction process. 

If the classification of persons on the basis of their consistency 
of performance is to aid in improving predictions for sub-groups 
of these persons, then the question of the generalizability of such 
consistency becomes important. To what extent can consistency of 
behavior be regarded as a trait which characterizes the individual? 

Using seven intra-individual variability indices based on six 
tests, the author (Berdie, 1969) found that the intercorrelations 
between the indices ranged from —.18 to .49, Of the fifteen cor- 
relations between variance indices of the six tests, five attained 
statistical significance beyond the .05 level and the variance indices 
for five of the tests were significantly correlated with the variance 
index of at least one of the other tests. For one test the variance 
index was not correlated with the index of any of the other tests. 
On these six repeated psychological measurements, the evidence 

suggested that consistency of behavior is not highly specific to each. 


nn, eee eae 


n. RALPH F. BERDIE 237 


; also that currently one cannot conclude that a highly 
„consistency of behavior is observable over several tasks. 
‘observed that the correlations between variance estimates 
chool grades in six subject matter areas, including a total 
ranged from .04 to .34. Of the fifteen intercorrelations, 
e part-whole correlations, nine were statistically signif- 
rond the .05 level. These results suggested that persons 
ades are variable in high school in one subject tend to 
iable grades in other subjects but the relationships are 
reflect no easily observed generalized variability trait. 
one can study the usefulness of indices of intra-individual 
y as moderator variables to improve prediction, the be- 
"or behaviors on which the variability indices are to be 
x must be selected. The present research was concerned first, 

‘the identification of appropriate variability indices, and 
ly, with the relationships between these indices and pre- 
ility of academic behavior. The first concern is discussed 

e (Berdie, 1969). The purpose of this report is to describe 
hods explored in studying relationships between variabil- 
d predictability and present the evidence revealed. 


Method 


subjects were Institute of Technology freshmen entering the 
sity of Minnesota in the Fall of 1966. All freshmen were told 
possibility of participating in the experiment and subjects 
» selected from volunteers on the basis of class schedule, prox- 
to campus, and availability of data. The subjects tended 
‘somewhat superior academically to the total entering class 
the experiment was conducted during the second academic 
er and included only students who survived the first quarter. 
students were a representative group of bright college stu- 
motivated to earn forty dollars by participating in an experi- 
at would cause them no stress or discomfort. 


Providing a Basis for Inferring Intra-individual Variability 
jects were classified in terms of intra-individual variability 
asis of their performance on six of the brief, highly speeded, 
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Repetitive Psychometric Measures developed by Moran and Mef- 
ferd (1959). These are relatively pure factorial tasks frequently 
described in the factor analytic literature. 

The Aiming test (A) consists of fifteen rows each containing 
twenty circles connected sequentially by a line. The subject places 
the test on a piece of corrugated paper and punches holes inside 
as many circles as possible without touching the circle. Subjects 
used a stylus consisting of a pencil-sized piece of wood with a thin 
metal point. The Flexibility of Closure test (FC) requires the sub- 
ject to copy thirty-six geometric figures into matrices of dots. The 
task, as described by the authors, is to retain the image of a specified 
configuration despite the influence of other distracting configura- 
tions in the perceptual field. The Number Facility test (NF) con- 
sists of ninety problems each requiring the addition of three two 
digit numbers. 

The Perceptual Speed test (PS) consists of rows of thirty digits 
with a digit in the left hand column of each row encircled and the 
task is to cross out every digit in the row similar to the encircled 
digit. The time limit specified by Moran and Mefferd for this test 
is two and one half minutes but early experience suggested that too 
many subjects completed the test with this time limit and the time 
limit was reduced to one and one half minutes, 

The Speed of Closure test (SC) measures the ability to unify an 
apparently disparate perceptual field into a single percept. Each 
form consists of twenty-two lines and each one has letters in it 
apparently arranged at random but containing from two to four 
four-letter words which are to be encircled. The final test, Visual- 
ization (V) consists of tangled lines which must be followed visually 
from their start to their finish. 

For each of these tests Moran and Mefferd developed twenty 
different forms so that the forms would be equivalent. Later re- 
search indicated that for five of the six tests the alternate forms 
were reliably different (Moran, Kimble, and Mefferd, 1964) and 
correction factors were provided for the twenty alternate forms of 
the five tests. These correction factors were not used in this study 
in light of the experimental design employed. 

Moran and Mefferd reported Teliability coefficients, based on 
correlations between form one and form two, for the six tests rang- 
ing from .72 to .94. For each of the one hundred subjects included 
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in the present research a score was derived from the odd numbered 
forms of each test and from the even numbered forms of each test 
and the correlations between these scores ranged from .96 to .99. 
The tests appeared to be reasonably reliable. 

Moran and Mefferd reported that intercorrelations between the 
six tests, using only one form, ranged from .09 to .44. In the present 
experiment a total score was available for each subject on each 
test, based on the twenty forms of the test, and the intercorrelations 
between these scores ranged from .28 to .65, considerably higher 
than the intercorrelations reported by the original authors. Their 
correlations were based on scores derived from tests requiring 
from two to five minutes and involved measurements of consider- 
able less reliability than characterizing the scores used here. This 
raises an interesting question as to the extent to which intercor- 
relations found between tests and resulting factor analytic results 
depend on the length and related reliabilities of the tests included 
in the factor analyses. In spite of the higher intercorrelations 
found here, the tests appear to be sufficiently independent from one 
another to justify their use in this research. 


Procedures 


The one hundred subjects were divided into five groups and 
each group was tested at the same time each day for five days dur- 
ing four successive weeks. One group was tested at 9:30 a.m., 
another group at 12:30 noon, and the three remaining groups tested 
at later times during the afternoon. Assignments to time periods 
were based in students’ class schedules. 

About one fifth of the subjects took form one of the test on the 
first day, form two on the second, etc. Another group of subjects took 
form five on the first day, form six on the second day, and on the 
twentieth day took form four. One group of subjects started on 
form nine, another group on form thirteen, and the final group on 
form seventeen. Within each time session students were randomly 
assigned to sequence groups. 

The experimenter read to each group an introductory statement 
to explain the purposes of the research. Subjects were told the 
experiment was designed to provide comparisons between the 
Psychological characteristics as measured by these tests of students 
in technology and science with characteristics of other students. 
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The tests were administered by a trained and experienced psy chom- 
etrist who read the test instructions at the first session, adminis- 
tered the practice exercises provided by Moran and Mefferd, and 
then administered the tests. Testing schedules for each group were 
arranged Monday through Friday for four successive weeks and 
subjects who missed sessions made them up during the fifth week. 
Over ninety-five percent of the tests were administered to the sub- 
jects at the time of day originally scheduled and of the two thou- 
sand subject-attendances, 1,924 occurred on the day scheduled. 

At the completion of the last form of the last test, each subject 
completed a questionnaire reporting his reaction to the tasks.and 
his perceptions of the purpose of the experiment. In response to an 
open-ended question, 18 per cent of the students reported that they 
thought that the purpose of the experiment was related to the con- 
sistency of behavior but later, in responding to a check list con- 
taining five items pertaining to the research purpose, 57 per cent 
checked the item, "The experiment was concerned with the con- 
sistency of my test behavior.” These figures suggested that a 
reasonably large proportion of subjects had some realization that 
the experiment was concerned with the consistency of behavior 
but little indicated that the subjects were strongly motivated to 
eue consistently. They did report high motivation to perform 
well. 

The experiment was conducted in a well isolated sub-basement 
room with over-head lights and lamps arranged so that illumina- 
tion was adequate. Little distraction occurred. Subjects were seated 
in the center of the room in classroom chairs with arm tablets. 


Prediction Data 


In addition to the one hundred and twenty Repetitive Psychologi- 
cal Measurement test scores, additional data were available for 
subjects. Those included a score on the Minnesota Mathematics 
test, a Score on the Minnesota Scholastic Aptitude test, high school 
percentile rank, obtained Fall quarter grade point average, and 
predicted Fall quarter grade point average derived from the mathe- 
matics test. From the last two indices a discrepancy score was 
obtained by subtracting the obtained grade point average from the 
predicted grade point average. 


The Minnesota Mathematics test was developed as a means for 
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admitting students to the University of Minnesota Institute of 
Technology and is a comprehensive examination covering high 
school mathematies, with emphasis on algebra. Correlations be- 
tween this test and Fall quarter grade point average range about 
50 for groups of freshmen admitted to college on the basis of 
information other than this test score. The Fall quarter grade 
point average was based on the grades of all courses taken during 
the first quarter in the Institute of Technology. Students customarily 
register for from three to four courses requiring a total of fifteen 
to twenty hours per week of class and laboratory attendance. For 
the one hundred subjects the correlation between the Minnesota 
Mathematics test score and the obtained Fall quarter grade point 
average was .33 and the results of the experiment must be inter- 
preted in light of the relatively low predictive efficiency of this 
instrument. 

For each student the predicted Fall quarter grade point average 
was obtained by using a single variable regression equation based 
on the mathematics test scores derived from a previous class of 
Institute of Technology freshmen. For the one hundred subjects 
the correlation between the mathematics test score and the pre- 
dicted Fall quarter grade point average was .84. The difference be- 
tween this and a correlation of one reflects possible change in the 
population and the extent to which error has been incorporated 
into the original regression equation. The reliability of the obtained 
Fall quarter grade point average can be inferred from the correla- 
tion between obtained Fall quarter grade point average and ob- 
tained Winter quarter grade point average, .70. 


Analysis 


The twelve thousand Repeated Psychological Measurement tests 
Were scored by research assistants; scoring was checked; and, when 
hecessary, papers were re-scored. Test scores and other data then 
were entered on basic record cards for each student, verified, and 
then punched and verified on IBM cards. 

For each of the six tests a variability index was computed for 
each student, This consisted of the variance (SD*) of the twenty 
raw scores derived from the twenty forms of the tests. Then, in 
order to facilitate comparisons between tests and to provide a 
basis for obtaining a total variance index, each raw score was 
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transformed to a standard score, using a mean equivalent to 50 
and a standard deviation equivalent to 10, based on the distribu- 
tion of one hundred scores of each form of each test. To illustrate 
this, the mean and standard deviation were calculated for the 
one hundred scores on form one of the Aiming test and for each 
student, the raw score on form one was transformed to a standard 
score based on this distribution. Using the standard scores for each 
of the six tests variability indices were computed and a seventh 
variance index was calculated for each student, based on all one 
hundred and twenty standard scores. 

The consistency of the variability indices was analyzed and an 
odd/even reliability coefficient obtained for each of the six tests 
and for the total variance index. The raw scores on each of the ten 
odd-numbered forms were used to obtain one variance index and 
the scores on each of the ten even-numbered forms were used to 
obtain a second index and the correlations obtained for the six 
tests were: A .83, FC 41, NF .80, PS .55, SC .57, V .25. The com- 
parable correlation for the variance index encompassing all 120 
standard scores was .89. 

The variance indices based on five of these tests and on the total 
score were sufficiently reliable to suggest that these measures of 
variability themselves were consistent. 

Table 1 presents the intercorrelations between the six variance 
indices. Six of the fifteen correlations are statistically significant 
and the variance on five of the tests is related to the variance on at 
least one other test. The variances based on the six tests and the 
total variance, which provides a part-whole correlation, correlate 
between —.03 and .49. 

These intercorrelations suggest that variability over time on 
some tasks is related to variability on other tasks, but these rela- 
tionships are no more than moderate, even when one considers 
the reliabilities of the observations. Intra-individual variability 
is not specific to each task and neither is there a strongly general- 
ized characteristic of variability that extends over a broad variety 
of tasks. 

These Tesults suggest that at least two of the tasks studied, 
Aiming and Number Facility, and the total variance index, may 
provide adequate measures of intra-individual variability. 

The most comprehensive variance index, and the one with the 


TABLE 1 
Intercorrelations between Variables Using RPM Standard Scores 


N = 100 


Variable 


Mean SD 
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highest reliability, was the one based on all one hundred and twenty 
standard scores. Although this measure of intra-individual vari- 
ability encompassing six different tasks is difficult to interpret 
because of its heterogeneity, it does provide one means for testing 
the hypothesis that intra-individual variability is related to pre- 
dictability of behavior. 

To test this hypothesis, the mean of each student’s one hundred 
and twenty standard scores and the variance of these scores were 
recorded and the one hundred subjects were divided into four 
groups, those with both the means and the variances below and 
above the group average, those with the means below and the 
variances above the group average, and those with the means above 
and the variances below the average. Then groups were recombined 
so that one group included all of those with mean indices above 
the group average, one with mean indices below the group average, 
one with variance indices above the group average, and one with 
variance indices below the group average. 

For each of these eight groups statistics were calculated, in- 
cluding: the means for the mathematics test, the predicted Fall 
quarter grade point average, the obtained Fall quarter grade point 
average, the discrepancy between these, the correlations between 
the mathematics test and obtained Fall quarter grade point aver- 
age and predicted grade point average and obtained grade point 
average. 

Then for the total group of one hundred subjects correlations 
were calculated between all of the variables, including the correla- 
tions between the several variance indices and discrepancy be- 
tween obtained and predicted grade point average. 

Then, in light of the demonstrated interactions between the 
variance index and the mean index, the total group of one hundred 
was divided into four sub-groups: students with correctly predicted 
grade point averages, which were high, those with correctly predicted 
grade point averages which were low, those whose grade point 
averages were under predicted, and those whose grade point averages 
were over predicted. The variance indices on the six tests were 
compared for these groups (Hobert and Dunnette, 1967). 

1 Then the group was divided into four other sub-groups con- 
sisting of subjects with both low mean scores and low variance 
indices on the Aiming test, those with low mean scores and high 
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variance indices, those with high mean scores and low variance 
indices, and those with high mean scores and high variance indices, 
and the grade point discrepancies for these groups were compared. 

Then the entire group was divided into two groups, one with 
above average variances indices on the Number Facility test and 
one with below average variance indices on this test, and the groups 
were compared on the basis of correlations and means. 

In order to allow for possible effect of size of raw scores on the 
tests, another group of students was selected consisting of those who 
had mean scores on both the Aiming and Number Facility tests 
that placed them within plus and minus one standard deviation of 
the means for the total groups and for this group correlations were 
determined between variance indices and other characteristics, 
including grade point discrepancies. 

Next, for each student the coefficient of variation was calculated 
on the Number Facility test and comparisons were made of stu- 
dents with coefficients of variation above the median and those 
below the median. 

These various analyses were designed to provide information 
concerning the differences in predictability of students with dif- 
ferent variability characteristics. 


Results 


Consideration of Variance and Mean Indices 


Table 2 shows the characteristics of the one hundred subjects 
divided into eight sub-groups on the basis of individual subject 
mean and variance on total score derived from responses to all 
twenty forms of each of the six tests. The first group of twenty- 
seven subjects had a mean standard score for the one hundred and 
twenty tests and a variance of these scores placing them at or above 
the group mean average and group mean variance for the total 
group of one hundred subjects. The second group had mean scores 
above the group average but variance indices below the group 
average and the next two groups consist of thirty students with 
both means and variances below the group average and twenty- 
two with means below and variances above the group average. The 
fifth, the high mean group, consists of the first and second groups 
combined, the next group, the low mean group, of the third and 
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TABLE 2 


Characteristics of Subjects Divided into Groups on the Basis of Means on the Standard Score 
Total Mean Indez and Total Variance Indez 


Mean index hi hi lo lo hi lo Total 1 
Variance index hi lo lo 


mean 34.00 32.00 33.87 31.23 33.13 32.75 32.76 33.10 32.9 
8D 8.20 4.72 9.43 7.58 6.91 8.71 7.97 7.83 7.86 
Predicted fall quarter 
grade point average 
mean 2.98 2.22 2.21 2.10 2.25 2.16 2.20 2.21 2.21 
SD .36 .25 .38 .38 .32 .38 .38 33 E 
Obtained fall quarter 
grade point, average 
mean 2.60 2.30 2.27 2.34 2.50 2.29 2.48 2.30 2.30 
SD 52 .54  .80  .74 .4 — .77 .64 0 6 
Grade point average 
discrepancy 
(algebraic) 
mean and a 94 -.713 . .28  .09 9 
SD gu gura Ee Ig vo. .e e € 
T MMS & OFQGPA 


T PFQGPA & OFQGPA 


fourth groups combined, the next *high variance" group of the first 
and fourth groups combined, and the last “low variance" group of 
the second and third groups combined. 

The hypothesis, “Subjects who were more variable are less pre- 
dictable" can be tested by looking at the correlations between 
predictors and criteria and also by looking at the discrepancies 
between predicted and obtained grade point averages. For the low 
variability group, the eighth group, the correlation between mathe- 
matics test score and the obtained Fall quarter grade point aver- 
age was .39, as compared to the correlation of 28 for the high vari- 
ability group, group seven. The difference is in the expected direc- — | 
tion but is not statistically significant. 

The mean discrepancy between predicted and obtained grade point 
average for the low variability group was .01 with a standard 
deviation of .63, as compared to the mean discrepancy of .28, with 
a standard deviation of .61, for the high variability group. The “t” 
for the mean difference was 2.19, showing that the difference was 
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statistically significant between the probability levels of .05 and 
01. The discrepancy analysis supports the hypothesis that persons 
who have low variability indices will perform more as predicted 
compared with persons with high variability. 

The two variability groups were quite similar on the basis of 
mean mathematics test scores and predicted grade point average. 
The high variability group obtained somewhat better grades on 
the average than did the low variability group but the difference 
was not statistically significant. 

The correlational comparisons of the four groups divided on the 
basis of both variance and means reveal for the groups with above 
average mean indices, that the low variability group is more pre- 
dictable than the high variability group. The correlations between 
the mathematics test and the Fall quarter obtained grade point 
average for the former group was .67, for the latter group .16. 
Transformation of these coefficients to “z” and evaluation of the 
significance of the difference provide a ^£" of 2.08, significant with 
a probability between .05 and .01. When the mean discrepancies 
between predicted and obtained grade point averages are com- 
pared, .14 and .32, the differences are not statistically significant. 

When the low and high variability groups with below average 
mean indices are compared, the correlations are almost identical, 
35 and .36, and the discrepancies .06 and .23, are not statistically 
significantly different. Thus, the hypothesis has some support from 
this analysis but the results suggest that the hypothesized relation- 
ship between variability and predictability is observable mainly 
for subjects achieving high mean scores on the tasks from which 
the variability index is derived. 


Intercorrelations of all Variables 


Table 1 shows the means, standard deviations, and intercorrela- 
tions for all the variables, with the RPM indices based on standard 
Scores. A similar table using raw scores revealed essentially the 
same relationships. Obtained Fall quarter grade point average, 
the criterion used here, was substantially correlated with discrep- 
ancy between obtained and predicted grade point average and 
significantly correlated with mean scores on the Aiming, Number 
Facility, Perceptual Speed, Visualization, and total mean score on 
the RPM tests. Obtained Fall quarter grade point average was 
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not correlated signifieantly with any of the variance indices. 

The GPA discrepancy was correlated .19, with the variance 
index on the Visualization test, but this was not statistically signif- 
icant, The discrepancy was correlated with the mean score on three 
of the six tests and with the total mean score. The absence of 
significant correlation between the discrepancy and the seven vari- 
ance indices does not lend support to the general hypothesis and in 
light of the results presented in Table 2, again suggests that the 
relationships, if present, are quite complex. 

The mean indices are significantly correlated with the variance 
indices on five of the six tests and this correlation for the Perceptual 
Speed test is moderate and negative. For the total index the cor- 
relation between variance and mean is .21, barely significant at the 
-05 level. 


Consideration of Relationship between Predictor and Criterion 


The relationship between predicted and obtained grade point 
average and the seven mean and seven variance indices is analyzed 
in another way in Table 3. A scatter diagram was prepared show- 
ing the relationship between predicted and obtained grade point 
averages and the group mean predicted and group mean obtained 
averages were determined. The total group was divided into four 
sub-groups corresponding to the four quadrants. One quadrant con- 
tained thirty-one subjects with both predicted and obtained high 
grade point averages, another contained thirty-six subjects with 
both predicted and obtained low grade point averages. One quad- 
rant contained eighteen subjects whose obtained predicted grade 
point average was high and predicted grade point average low, 
and the fourth quadrant contained the fifteen subjects with high 
predicted grade point averages and low obtained grade point aver- 
ages. For each of these groups, Table 3 presents the mean and 
standard deviation for each of the seven variance indices and for 
each of the seven mean indices. The subjects with the most ac- 
curately made predictions are in the first two groups, the ones with 
the least accurate predictions in the last two groups. Examination 
of the average variance indices shows few consistent or meaningful 
trends. When one combines the sixty-seven subjects in the two 
groups most accurately predicted and the thirty-three subjects in the 
two groups least accurately predicted, the correct prediction group 
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has the highest mean variance on the Aiming, Flexibility of Closure, 
and Number Facility tests and the incorrect prediction group has 
the highest mean variance on the Perceptual Speed, Speed of 
Closure, and Visualization tests. The mean variances for the two 
groups on the total measure are practically identical. 

On the first three tests the mean variances for the low achievers 
who were correctly predicted are higher than those of the other 
three groups, but this is not found for the last three tests. When the 
two groups with low predicted grade point averages are compared, 
the group that achieved according to prediction and the group that 
did better than predicted, a difference in the mean variance on the 
Aiming test is statistically significant with a probability of less 
than .01 and the variances also are significantly different for the 
variance index on this test but this does not appear to fit in with 
any of the other results. Perhaps if any differences can be ob- 
served in this table, they refer more to differences between high | 
and low achievers than to differences between correct and incorrect 


Table 4 presents the characteristics of four groups divided on 
the basis of variance index and mean index on the Aiming test, 
using raw scores. Thirty subjects had both mean and variance in- 
dices below the group mean, eighteen had mean indices below and 
variance indices above the group means, etc. When the two groups 
of students with low mean indices are compared, those with low 


predictions. 
Consideration of Mean and Variance Indices on Aiming 
variance and those with high variance indices, the difference in 


TABLE 4 
Comparison of Low Scoring-Low Variance, Low Scoring-High Variance, Hit ing 
+ B jj B 2 gh RU) 
Variance, and High Scoring-High Variance Groups, Using the Aiming Test of the RPM’, 
on Predicted and Obtained Fall Quarter Grade Point Average and Discrepancy between 
Predicted and Obtained GPA 


Predicted Obtained GPA | 
FallGPA ^  FalGPA discrepant? 
N Mean SD Mean SD Mean 
Low mean low variance quadrant, 30 2.2 9 
à 26 (.35 2.40 .609 8.15 

Low mean high variance quadrant 18 2.20 .34 190 .68 -.30 -@ 
igh mean low variance quadrant 18 — 2,35 38 2.68 156 30 95 
High mean high variance quadrant 34 2.08 31 2.50 59 4 x 
of Aes teet in dart are based on raw scores of Aiming test, Corresponding table based on atandard s^ 


Group 
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TABLE 5 


Correlations between Predictors and Fall Quarter. Grade Point Average for Groups with 
High and Low Variances on the RPM Number Facility Test 


median median 
N» 47 46 
Correlation between Math test and Fall quarter GPA 34 .32 
Correlation between HSR and Fall quarter GPA .46 .48 
Correlation between MSAT and Fall quarter GPA .21 .15 
Correlation between MSAT and HSR +383 -26 
Correlation between Math test and HSR -26 -07 
Math test mean 31.62 33.07 
SD 5.88 8.53 
HSR mean 85.91 89.41 
SD 10.80 7.59 
MSAT mean 54.43 55.15 
SD 11.09 12.02 
GPA mean 2.41 2.36 
i SD ‘67 70 
Discrepancy between pre- 
dicted and obtained GPA mean 23 -14 
SD 64 .63 


* HSR and MSA available for only 93 of 100 subjects. 


the mean discrepancies is statistically significant beyond the 05 
eatest discrepancy. 


level with the high variance group having the gr 
When the two groups with high mean indices are compared, again 
the high variance group has the greatest discrepancy, but the dif- 
ference is not statistically significant. In Table 2, the statistically 
significant difference between the high and low variability groups 
was found for the groups with mean indices above the group 
average; with the Aiming test, the significant difference was nes 
for the low mean index group. Again some, but far from conclusive, 
evidence is available concerning the relationship between variance 
and predictability. 


Variability on Number Facility 

Table 5 shows characteristics of the subjects divided on the basis 
of variance index on the Number Facility test. Of the ninety-three 
subjects for whom complete data were available for this analysis, 
forty-seven were in the high variability grouP and forty-six in the 
low variability group. For these two groups correlations betwott 
the predictor test and obtained grade point average were almost 
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TABLE 6 


Relationships between Variance and Predictability for 
56 Subjects in Middle Ability Range* 


Correlation between discrepancy between predicted and obtained 


Fall quarter GPA and variance on Aiming test r= —02 
Correlation between discrepancy between predicted and obtained 

Fall quarter GPA and variance on Number Facility test r= .02 
Correlation between above two variances r= —.03 


Correlation between predicted and obtained Fall quarter GPA r= 5 
Correlation between predicted and obtained GPA for 28 subjects 


with lowest variance on Aiming test vom 58 
Correlation between predicted and obtained GPA for 28 subjects 
with highest variance on Aiming test r= 40 
Correlation between predicted and obtained GPA for 28 subjects 
with lowest variance on Number Facility test r= 58 
Correlation between predicted and obtained GPA for 28 subjects 
with highest variance on Number Facility test r= 46 


OO SS reece v NIME 
.* These subjects had mean scores on both Aiming and Number Facility tests that placed them 
within SD of the means for total group. 


identical. The diserepancy between obtained and predicted grade 
point average was slightly greater for the variable group than for 
the less variable group, but the difference was not significant. 


Middle Range Subjects 


Table 6 shows the characteristics of students who on both the 
Aiming and Number Facility tests had mean scores that placed 
them within plus and minus one standard deviation for the total 
group mean. Again no significant correlation appeared between 
the predicted and obtained grade point average and the variance 
indices of these two tests, When the twenty-eight subjects with 
lowest variances on the Aiming test were compared with the twenty- 
eight subjects with highest variance, the low variability group had 
the highest correlations between predicted and obtained grade point 
average, a correlation of .58 as compared to a correlation of .40 
for the other group. When the groups were divided on the basis of 
variability on the Number Facility test, again the low variance 
group had the highest correlation, .58, as compared to the correla- 
tion of .46 for the other group. With groups as small as these, 
differences between correlations to be significant would have to be 
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much larger than one could reasonably expect to obtain so the 
results are only mildly suggestive. 
l 


Relative Variability 


Finally, in order to study the possible effect of the relationships 
between the mean and variance indices themselves, the use of the 
coefficient of variation was explored and the results are presented 
in Table 7. 

The ninety-three students for whom complete data were available 
were divided on the basis of high and low coefficients of variation 
on the Number Facility test. For the variable group the correlation 
between predicted and obtained grade point average was .26, for 
the less variable group .55. The correlation between the Minnesota 
Scholastic Aptitude test and grade point average for the variable 
group was .10, for the less variable group .35. The correlation be- 
tween the Minnesota Scholastic Aptitude test and high school per- 


TABLE 7 


Comparisons of Subjects Divided into Two Groups, Those Above and Those Below the 
Median Coefficient of Variation on the Number Facility Test 


(N = 934) 
Low coefficient of High coefficient of 
variation group variation group 
| N = 48 N=47 
Minnesota Math test 33.65 7.88 31.04 6.53 
Predicted Fall quarter GPA 2.26 i7 2.4 34 
Obtained Fall quarter GPA 2.49 .58 2.28 -80 
Difference between predicted 
and obtained .23 45 13 NS 
High school rank 88.89 8.88 80.43 9.94 
_ MSAT 54.17 10.55 55.38 12.45 
. Correlated variables 
Minn. Math test and Fall quarter GPA 46. -20 
Predicted and obtained Fall 
quarter GPA .55 e 
' High school rank and Fall quarter GPA .48 4a 
MSAT and Fall quarter GPA 35 S 
MMT and Winter quarter GPA -34 94 
High school rank and Winter GPA -26 <52 
MSAT and Winter quarter GPA 13 i 
all quarter and Winter quarter GPA .63 a 
MSAT and HSR 42 -22 
*P « 05. 


* High School Rank and MSAT score available for only 93 subjects. 
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centile rank also differed although not significantly. The only 
correlations that differed significantly were those between the 
Mathematics test and high school percentile rank, where the cor- 
relation was —.05 for the variable group, +.34 for the less variable 
group. Using this variability index the two groups also differed on 
the basis of mean Fall quarter grade point average, with the less 
variable group achieving better grades. 


Discussion 


Answering the question, “Is intra-individual variability related 
to predictability?” is not easy. The definition and measurement of 
intra-individual variability are complex. The answer to the ques- 
tion depends in part on the tasks used and the derived indices. 
The function being predicted makes a difference, and more than 
one method is available for observing the accuracy of prediction. 
On five of the six tasks used here, the subject’s level of performance 
was significantly related to his variance and one of these coefficients 
was negative. The answer to the question also depends in part on 
the segment of the range being observed of the performance level 
of the task providing the variance index. Some of the evidence 
suggests that the answer depends in part on the level being con- 
sidered of both the predictor variable and the predicted variable. 
The results do suggest that predictability of behavior, as defined 
here, is influenced to some extent by the subject’s intra-individual 
variability but that the relationship is not a straight-forward one. 

The total variance index based on one hundred and twenty 
Scores is most difficult to interpret insofar as it consists of twenty 
Scores on each of six relatively independent tests and these are all 
transformed to standard scores, Using this complex index of intra- 
individual. variability, the correlation for the less variable group 
between the predictor and the criterion is higher, but not statistically 
significantly so, than the correlation for the variable group. When 
the mean discrepancies between predicted and obtained grade point 
averages are compared for the high and low variability groups, 
however, the difference is statistically significant, In light of this 
statistically significant difference and in light of the difference in 
the correlations being in the expected direction, the evidence does 


suggest that this complex variance index is related somewhat to 
predictability. 


F 
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Whereas the correlation between the predictor and predicted 
variable appears to be relatively easy to interpret, the GPA dis- 
crepancy measure is not. Two methods are available for handling 
the discrepancy: it can be treated algebraieally or arithmetically. 
Treating it algebraically pays attention not only to the size of the 
discrepancy but also to its direction whereas treating it arithmeti- 
cally reflects only the size of the discrepancy and ignores its direc- 
tion. 

The analysis in Table 2 was done using both the algebraic and 
the absolute discrepancies. Whereas the difference between the two 
groups on the algebraic discrepancy was significant, the absolute 
discrepancies were the same for the groups. When the two groups 
with the high means were considered, the one with the high variance 
and the other with the low variance, the algebraic discrepancy was 
higher for the high variance group. For the two groups with low 
mean indices, the low variance group had the lowest algebraic 
discrepancy but the highest absolute discrepancy. 

If relationships between the algebraic discrepancy and the other 
variables were curvilinear a more precise picture of the relationship 
might be provided by using the absolute or arithmetic discrepancy. 
These relationships were examined for curvilinearity, and no evi- 
dence was observed suggesting other than rectilinear relationships. 
At the same time these observations were made, the relationships 
between mean and variance indices were observed and no evidence 
was found that these were curvilinear. 

The difference in correlations shown in Table 2 for the two groups 
with high means, as compared to the corresponding difference for 
the two groups with low means, is difficult to explain. For the two 
groups with high means, the group with high variance had a 
standard deviation of 8.20 on the Mathematics test and .52 on the 
criterion Grade Point Average. The low variance group had a much 
smaller standard deviation on the Mathematics test, 4.72, and 
about the same standard deviation on the GPA, .54. The interaction 
between the mean value and the variance value apparently has to 
be considered. 

If the correlations between GPA discrepancy and the seven 
variance indices in Table 1 had proved to be statistically significant, 
the most easily interpreted evidence concerning the hypothesis 
ee be available. Obviously, the relationships were not that 
Simple, 
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The attempt to consider the interaction between the mean index 
and the variance index by using the coefficient of variation pro- 
vided results that indicated, although often not significantly, that 
taking into account both of the measures, the less variable group 
was more predictable. The coefficient of variability is not a rigorous 
statistic but within the defined limits of this analysis, the results 
did suggest that the hypothesis might well be best tested taking 
into account measures corresponding both to the mean and variance 
indices. 


Conclusion 


The evidence suggests a relationship between the variability 
over time of a person's behavior on simple tasks and the effective- 
ness with which his academic performance can be predicted. The 
relationship between intra-individual variability and predictability 
is complex and the measurement of this relationship must consider 
the subject’s level of performance on the tasks from which the 
variability indices are derived, the level of performance on the 
predictor and predicted variable, and the method used in deter- 
mining accuracy of prediction. 

If intra-individual variability is an effective moderator variable, 
accuracy of prediction perhaps can be increased for the least vari- 
able group, but prediction for the variable group will remain in- 
efficient. If the efficiency of prediction is to be increased for the 
total group, and particularly for the group with greater intra- 
individual variability, then new prediction models may have to be 
devised and new predictors or new criteria studied. 

For example, for the group of persons with Tun variability, 
highly speeded tests may be inappropriate as predictors. For this 
group a rectilinear regression model may be inappropriate and the 
nature of the shape of the relationship between predictors and 
eriterion may be different. For the least variable group, a criterion 
based on one quarter of academic performance may be appropriate 
whereas for the group with greater temporal variability, a criterion 
of academic success based on several semesters of work may be more 
appropriate. 

Temporal intra-individual variability may prove to be a useful 
and meaningful concept but before this can be determined further, 
more efficient means will be needed for observing such variability. 
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The results here can be interpreted only as being suggestive. They 


are promising. 
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CAPITALIZATION ON CHANCE IN ROTATION 
OF FACTORS* 
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Factor analysis is used for two different theoretical purposes: 
to search for structure among correlated measures for purposes of 
theory construction, and to test hypotheses, derived perhaps from 
previous factor analyses, about measures. For purposes of the first 
sort a variety of computer programs for both oblique and orthog- 
onal rotation of factors is available. Visually guided rotations are 
also still used. For purposes of the second sort visually guided 
totations are also used, but either orthogonal or oblique rotations 
by computer to a target matrix is a typical procedure. 

The question arises with respect to either of these theoretical 
Purposes, and to either visually guided or computer rotations, as to 
the extent to which the investigator can capitalize on chance. In 
multiple regression, for example, the investigator capitalizes on 
chance to an extent which is a function of the number of variables 
w and the number of observations (N). The investigator can 

shrink" his multiple correlation, if he is interested in the popula- 
tion value, or he can cross-validate, when he wishes to apply weights 
to independent samples. While an analogous phenomenon is expected 
in the factor analytic rotational process, the investigator has no 
statistical basis for estimating population values nor has he an 
empirical technique comparable to cross-validation at his disposal. 

There is one empirical study which indicates the practical impor- 
tance of the problem. Horn (1967) factored the intercorrelations 
= 
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of random normal deviates by standard procedures, ured the rail 
from the factoring of his paychologica! variables to form a tant 
matrix, abd. rotated the factors obtained from the random basal 
deviates to an oblique best fit, in the least squares sense, ta te 
target matrix. By the standards of most factor analysts the fi wss 
& good ome, but the parameters that produced the fit were bot Bm 
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te dependent variables to evaluate goodness of 1 in 
of the research are the means of the correlations epe 
values of unity, the standard deviations of thore sume 
, and the standard deviations of the correlations epe 
target values of sero. (The expected values of the means of 
correlations are sero, and departures from the expected 


seeking, was generated as follows: A sub-set of griseiqual 
e factor matrices used for the hypothesis testing phase 
Peecareh were rotated by means of the Maxphene (Men 
Üblimax (Pinka and Saunders, 1054), Misermamin 
and Dickman, 1950), and Varimax (Kaier, 1008), Mate 
ving n equals 48 were omitted to erosemine oa m 
Output from the three oblique programe is in the fana 
pattern matrix so that tbe numerical value presented 
loadings 
E 4e dependent variables in this phase ef ihe menn 
Bamber and percent of variables ia the hyperplane, €% 
5.10, and the number and per cent of variable having 
I rester than an arbitrary value of 20 
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not different from estimated communalities of many “real” vari- 
ables. Percentage contribution to variance is arbitrary, since fixed 
values of the three parameters were selected, but within the limits 
of these fixed values N makes the biggest contribution to total 
variance. Sufficient common factor variance for rotational purposes 
is available. 

Hypothesis Testing Data. It is more dramatic to look at an 
example of the results before looking at them more analytically. 
For N equals 48, n equals 48, and m equals 24, the highest values 
of the “marker” variables are obtained. A representative portion 
of the table of reference vector correlations in the oblique solution 
is presented in Table 2. Obviously, a good simple structure has been 
obtained. Furthermore, for the orthogonal case the goodness of fit 
is only slightly poorer so that an additional example would add 
nothing. 

For the example in Table 2 the mean of the 48 correlations of 
the marker variables is .54 and the standard deviation is .10. 
Many factors have been defined by lower values of “real” variables 
in situations involving similar parameters. The hyperplane fit is 
also quite good with the mean being the expected value of .00, and 
the standard deviation .11. With about 65 per cent of the variables 
in the range between +.10 and —.10, the hyperplane count is quite 
good. It is also noteworthy that this variability is lower than the 
random variability of the original correlations. 

Along more analytieal lines, each of the dependent variables 
was entered in a four-way analysis of variance. Since identical 
factor matrices were rotated by both oblique and orthogonal pro- 


TABLE 1 


Means of the Median Communality Estimates 
and Percentage Contributions to Variance for the The Parameer 
EEE — a a u u u u uaua 


Variables Means Per cent of Variance 
Number of Variables 12 24 48 
Median h? .236 .320 .464 22.9 
Number of Factors 1/4n 1/2n 
Median h? .243 .437 24.4 
Number of Observations 48 06 384 
Median h? .495 .356 .169 46.3 


* All variables ha; A 
P to of cron Ye satisfactory levels of statistical significance based on pooled interactions 
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TABLE 2 
Example of Reference Vector Structure Following 
Oblique Rotation to an Arbitrary Target Matriz 
eee 


I II II IV v VI XXIV 
1 59 0  -0 -02 09 09 20 
2 62 18 1 -0 -0 11 06 
3. 06 41 01 16 17 -15 -07 
n n 6 -—03  -10 u -12 01 
5& -08 —-01 43 02 07 -M 00 
6. 16  -01 60 12 20 04 —16 
T -0 09 09 58 0  -H 08 
& -02  -03 06 42 16 14 —19 
9 10 21 n 01 36  —02 02 
M.  —07 07 05 09 53 14 —06 
n 08  —16 05  -02 05 63 —08 
12. M -17 -18 05 17 42 08 
, 
à 
4. M4 -07 -07 -0 o2 -0 67 
ee o -0 o -o o 41 


grams, a mixed design was used. Thus rotational method makes a 
significant contribution to the variance of all dependent variables, 
but its percentage contribution to total variance tends to be small. 
N, n, and m also each make statistically significant contributions 
to total variance for each of the dependent variables. Interactions 
are small and, within the limits of variation of the parameters used 
in this study, can be disregarded. The means of the main effects 
and the several percentage contributions to total variance, which 
are more important than the size of the F-ratios, are presented in 
Table 3. 

The difference between having two and four marker variables 
Per factor makes the most substantial contribution to the variance 
of the means of the marker variables. This difference controls 60 
Per cent of the total variance. The number of cases used in the 
investigation controls more than half of the remainder. All inter- 
actions combined account for less than 7 per cent of total variance. 

The means of the standard deviations of the marker variables 
^re more variable than the means of the means. Combined inter- 
actions account for almost 25 per cent of the total variance, but 
no one interaction stands out as potentially significant. Tt'appears 
Instead that there is simply more error in the data. Of the several 
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TABLE 3 


Means of Three Dependent Variables and Percentage Contribution to 
Variance of the Three Parameters and Type of Rotation 


Variable* Means Per cent of Variance 

Number of Variables 12 24 48 

Means of Marker Variables +284 .808  .365 8.8 

SDs of Marker Variables .153 .166  .111 30.8 

SDs of Hyperplane Variables — .177 .157  .128 18.9 
Number of Factors 1/^n 1/2n 

Means of Marker Variables .230 .408 59.8 

SDs of Marker Variables .159 .129 12.4 

SDs of Hyperplane Variables — .173 .135 16.4 
Number of Observations 48 96 384 

Means of Marker Variables .376 .340  .242 24.3 

SDs of Marker Variables .172 :137 .123 23.6 

SDs of Hyperplane Variables — .193 .159  .110 54.9 
Type of Rotation Orthogonal Oblique 

Means of Marker Variables .811 .328 6 

SDs of Marker Variables .156 .131 8.8 

SDs of Hyperplane Variables .161 .147 2.4 

* All variables have satisfactory levels of statistical significance based on pooled interactions 


as estimates of error. 


parameters, the number of variables controls 31 per cent of the 
variance and the number of cases is again moderately high with 
24 per cent. 

The data for the means of the standard deviations of the hyper- 
plane variables are again relatively error-free. The number of cases 
makes the largest contribution to total variance (55%) with num- 
ber of factors splitting most of the remainder. 

The contribution to variance of type of rotation was attenuated 
by starting both orthogonal and oblique with the same unrotated 
factor matrix, but one’s impression of the percentage contribution 
figures is that they are still surprisingly small. On a. priori grounds 
the opportunities for capitalizing upon chance seem much greater 
when there is no control over the angles among the reference 
vectors. Furthermore, these relatively small effects are not the 
result of small departures from orthogonality in the oblique rota- 
tions. The size and pattern of angles among the reference. vectors 
is such that the intercorrelations of the factors, which involve in- 
version of the matrices of reference vector angles of separation, 
are typically very high. This confirms a finding of Horn (1967); 
but factors could be made to look more realistic with respect to 
their intercorrelations, with little loss in goodness of fit, if arbitrary 


1 36 —10 -o9 03 —02 03 17 
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ceilings were placed on the angles of separation in the oblique 
Procrustes program. Very high factor intercorrelations may in- 
dicate that rotations have been to some extent contrived, but low 
factor intercorrelations are no guarantee of objectivity as the pres- 
ent orthogonal case attests. 

Hypothesis Seeking Data. An example of a “good” rotational 
solution is presented again as an introduction to the main body of 
the results, Rotated factor loadings, this time following orthogonal 
rotation, appear in Table 4. The 12 factors extracted from 24 
variables were based on an N of 48. Hach rotated factor is defined 
by two or more variables with loadings greater than 30, and the 
average number of variables loaded on a factor is three. There 
are 154 loadings, or about 54 per cent of the total within the hyper- 
plane as defined. There are few who would deny the realistic ap- 
pearance of the rotational solution, though an investigator might 
express a reservation based on the small number of observations. 


TABLE 4 


Varimaz Rotated Factor Pattern for 24 Variables, 12 Factors, 
and 48 Observations 
eee 


l1 3. 8.4 -> B= SOO RUNE E 


4T 12 —25 18 09.08 284530 HD CO 
12 06 —04 07 —23 02 -35 06 
08 08 41 10 04 —44 —08 —27 
12 01 19 —o5 —07 83 08 09 —01 14 —07 —14 
21 —10 -08 31 20 -11 02 —03 
-12 01 —10 12 03 10 —06 04 
=18 —05 03. —08 —14 18 203,090 109 10109 Ia 08) 1007 
Ol -04 07 16 03 07 66 02 
Miia, 0401-719 00 Ose 04 
00 -07 09 01 —o5 —06 02 06 08 05 0l 
29 
04 
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2 06 04 -17 09 24 08 16 
B 15 Qo 35 -0 —45 -03 -20 12 7 34 30 
15.558 —13 —00 ,.01 —03 4405 To 10 03 1i 
16 1 1.00 18 -02 0-1 -0 -06 1 
18 -06 10 -08 93 —09 —05 09 —19 uw 4 
r4. 18 1h ja 02, Alae ODE? 09 09 —08 —08 

-M of 77 —o7 -00 13 08 08 — -% 


eae 14 —06 
il 14 coo —32 —45 24 —06 —27 39 

2 1-0 s d» 1520 M i3 20 -31 
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5 z -13 37 —21 —02 —06 —17 32 a po 25 —32 
«i Mes 02 -12 
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Nevertheless, most factor analysts would interpret such factors 
with confidence. 

The dependent variables selected for this phase of the research 
are the ones commonly used in factor analytic studies. Unfortu- 
nately, however, number of variables loaded on the factor and 
number of variables in the hyperplane are spuriously associated 
with the independent variables n and m. Furthermore, when the 
number of variables of both types is converted to a percentage, n 
and m are still, though differently, involved in the new dependent 
variables. Sole reliance should not be placed on the relationships 
involving either type; both have significance for the investigator 
who is trying to avoid substantial capitalization on chance in his 
investigation. 

The means of the main effects for each of the four dependent 
variables are presented in Table 5. Contributions to variance are 
also included even though the main effects are not all significant 
in these analyses. The use of the pooled interactions as the esti- 


TABLE 5 


Means of Four Dependent Variables and Percentage Contribution to Variance of the 1 
Parameters and Rotational Solution 


OO 


Per cati 
Variable Means Youu 

Number of Variables 

Number of low loadings» 

Number of high loadings* 

Per cent of low loadi 

Per cent of high loadings 
poss Factors 

umber of low loadingse 45.25 an 

Number of high loadings* 14.29 d po ne 

Per cent of low loadi 47.40 54.50 m 

Per cent of high loadings 18.08 14.67 e 
Number of Observations 48 ; 

Number of low loadings 58.88 70.90 84.08 2 

Number of high loadings ^ — 25:25 20.69 — 10/9 5l 

Per cent of low loadin 42.69 49.15 61.00 2 

Per cent of high loading» ^ 22,81 —— 18.0; 8.31 yis 
Rotational Programs Bino i : 

i rm.  Oblim. Vari 1. 

Number of low loading 80.67 62.00 79.58 E 29 

Por coer o high loadings 15.67 — 38:92 —— 15:59 25.50 B 

Per cent of low losdingst 55.76 — 4430 — 527 48.47 175 

er cent of high loadings 14.44 15.80 ; 20.83 e 
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mates of error does not appear to be as satisfactory as before. 
A few interaction terms stand out as potentially significant in all 
analyses. In particular the various rotational programs used may 
interact with both n and m in interesting ways that may deserve 
further investigation. There are also interactions involving com- 
binations of n, m, and N that make larger contributions to variance 
than certain of the main effects. These presumably arise from the 
nature of the dependent variables since all of the latter, in one way 
or another, have a spurious relationship with n and m. 

Among the rotational programs it is interesting that Maxplane, 
in a sense, belies its name. It is superior to the others in number 
and percentage of high loadings, but inferior with respect to hyper- 
plane count. Binormamin and Varimax provide almost identical 
results since the intercorrelations among Binormamin factors are 
almost zero. As a check on the comparability of factor patterns 
obtained with the four programs, indices of factor congruence were 
computed and distributed with the following results: Qs = -98, 
Qa = 96, and Q, = .88, with the Maxplane program producing 
most of the tail of the distribution. Factor interpretations would 
ot differ appreciably as a function of the rotational solution. Thus, 
invariance of rotational solution is no safeguard against capitalizing 
on chance. 

For number of both high and low loadings n makes the biggest 
contribution to total variance. N makes a substantial and signif- 
icant contribution only to number of high loadings, but the main 
effect means fall in the expected order for low loadings as well. 
Number of factors, on the other hand, makes & somewhat larger 
contribution to number of low loadings than to high. 

For percentage of both high and low loadings, N makes by far 
the largest contribution to total variance while n and m have 
greatly reduced contributions. For percent of low loadings the 
contribution of n is not significant, while for per cent of high load- 
ings the contribution of m is not significant, but both sets of non- 
Significant differences among means are consistent with the statisti- 
cally significant results. 

Discussion 
It is clear from these results that seemingly meaningful rotated 
factors ean be obtained from the intercorrelations of random normal 
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deviates whether the rotations are of the hypothesis testing 
hypothesis seeking variety. We must also ask, however, whether 
this has any important meaning for the use of factor analysis with — 
“real” variables. The answer is an unequivocal “yes.” l 

The primary argument is that basic to all statistical reasoning. 
The statement that a difference could readily arise as a random 
fluctuation from a population value of zero does not mean that the 
obtained difference is truly random. In a parallel fashion, to state 
that apparently well-defined factors can be obtained from the inter- 
correlations of random normal deviates does not imply that fac- 
tors of similar size obtained from similar combinations of N, m, 
and m are random or that the intercorrelations are random. Rather 
it suggests that more and better data are needed before the claim - 
can be made with any reasonable degree of confidence that the 
factors are nonrandom. 

Empirically there are a great many factor analytic investiga- 
tions reported in which many of the variables have distributions of 
correlations that do not differ markedly from random distributions. 
This is particularly true of investigations in the personality, in- 
terest, motivation, and temperament domains; but the widespread 
use of short tests of low reliability in all areas results in quite small | 
values of correlation coefficients, even though among abilities these 
correlations are predominantly positive. 

In some investigations the intercorrelations of the variables of 
major interest are clearly nonrandom but the inclusion of "hyper- : 
plane stuff” to help define factors in the rotational process has the 
effect of including near-random variables with the nonrandom. t 

Finally, in any investigation there comes a time in the process 
of extracting factors when the residuals become random or near- 
random and the resulting factors approach the same status as the - 
random factors in the present investigation. Unfortunately, the 
criteria for retention of factors for rotational purposes are uncertain 
in application; wise men differ radically in the decisions they make | 
Concerning the number of factors. As a result there may be many 
random factors in the literature from investigations that started 
with reliable variables of substantial communality and with 
intercorrelations based upon substantial Ns. | 

Are there ways of distinguishing between error and nonerror 
factors that are independent of size of loadings and number of 
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variables loading on the factor? Cattell (1958) has argued that 
there are distinguishable differences, but presents no empirical 
support. Horn (1967), following a suggestion made by the present 
senior author, found very high intercorrelations among his random 
factors, but as discussed above restriction to orthogonal rotations 
is very little safeguard. Replicability, which is the mainstay of the 
scientific method, is hopeless in factor analytic studies unless hedged 
about with more controls than is commonly the case. It is clear that 
with appropriate values of N, n, and m the Procrustes method, either 
oblique or orthogonal, could replicate random factors endlessly. 

We are finally left with psychological meaning or interpretability 
of the factors, This is, however, a very slender reed indeed since 
psychologists share the all-too-human trait of being able to ration- 
alize almost anything after the fact. 

The present study does provide guidance for the investigator who 
wishes to use factor analysis as objectively as possible. Number 
of marker variables per factor is clearly a very important element 
in the design of a study. Four markers are almost minimal within 
the limits of N available to most investigators, particularly when 
the investigator is working with variables of low communality. 
Note that this recommendation assumes that the various markers 
are independent of each other so that the intent of this recom- 
mendation is to limit the number of factors to one-quarter of the 
number of variables. 

No minimum N can be specified. In general, N should be as large 
88 feasible. Although no systematic study has been made of the 
issue, present standards for N, as revealed in published factor 
analytic studies, are probably too small. Definition of factors de- 
Pends upon stable differences among correlation coefficients as well 
as having correlations that are significantly greater than zero. 

Finally, n has significance even if adequate attention has been 
Paid to the first two parameters. The investigator should not add 
Variables just out of curiosity. The smallest number of variables 
Compatible with the purposes of the investigation should be the goal. 
it a large number of variables does seem to be required, the inves- 
tigator can compensate for this handicap by paying special attention 
to number of factors and number of cases. In the hypothesis testing 
study, even with 384 cases and four markers per factor, it was pos- 
sible to obtain mean reference vector correlations for the marker 
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variables of .21 with a standard deviation of .10 when the number 
of variables was 48. Most of these factors would be interpreted with 
varying degrees of confidence by factor analysts. 

Tt is highly useful in the hypothesis testing study that the param- 
eters combine in an essentially linear manner. Thus an investi- 
gator can estimate the level of any dependent variable for any 
combination of the three parameters from knowledge of the main 
effects. Interpolation and extrapolation within reasonable limits 
could also be done with considerable confidence. "Trade-oft" among 
N, n, and m is a reasonable procedure. If for some reason n and m 
are fixed at dangerous levels, N must be drastically increased. Or, 
if N is limited, n and m must be carefully selected to compensate for 
that limitation. 

The parameters seemingly do not combine in a linear manner, in 
the hypothesis seeking analogue in which the dependent variables 
are number or per cent of variables having “significant” loadings 
and number or per cent of variables in the hyperplane. Estimation 
of the level of these dependent variables for any combination of 
N, n, and m is made with less confidence, and interpolation and 
extrapolation are more hazardous. Nevertheless, the results serve 
as à rough guide for the design of hypothesis seeking factor analytic 
investigations, 

It should also be acknowledged that the investigatior can com- 
pensate in part for undesirable combinations of N, n, and m from à 
parameter not, included in the present study. This parameter is the 
communality of the measures used for which high reliability is @ 
necessary but not sufficient condition. Beyond high communality, 
although not easily quantifiable or even recognizable, is an even 
more basic condition; the fit of the factor model to the empirical 
data (Humphreys, 1968; Linn, 1968), 


Summary 


In order to estimate in a systematic way the extent to which an 
investigator can capitalize on chance in factor analytic research, 
nine correlational matrices were generated from distributions of 
random normal deviates for the various combinations of three 
levels each of number of observations and number of variables. 
By introducing two levels of number of factors, 18 separate factor 
matrices were obtained. These factor matrices were rotated by 
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both oblique and orthogonal methods to arbitrary target values in 
the hypothesis testing phase of the investigation, and by three 
oblique and one orthogonal program in the hypothesis seeking 
phase. Goodness of fit, in the former phase, was measured by three 
dependent variables and in the latter phase by four different de- 
pendent variables, selected to reflect the practices and thinking 
of investigators using the factor methods. 

In this fixed variable design N, n, and m all make substantial 
contributions to the extent to which capitalization on chance takes 
place in both the hypothesis testing and the hypothesis seeking 
phases. Good rotated structures as judged by current standards are 
obtained for combinations of the three parameters well within the 
bounds of current research design. The results can be used as a basis 
for judging whether an investigator has results well beyond chance 
expectations and, more importantly, for the design of more careful 
factor analytic studies. 
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A DISTINCTION BETWEEN JUDGMENTS OF 
FREQUENCY AND OF DESIRABILITY AS 
DETERMINANTS OF RESPONSE: 


DOUGLAS N. JACKSON 
University of Western Ontario 


AND 
SAMUEL MESSICK 
Educational Testing Service 


Tux research reported in this paper is directed at the solution of 
two inter-related problems: (a) the identification of the determi- 
nants of the largest factor of personality inventories such as the 
MMPI; and (b) the evaluation of consistencies in judging item 
Properties as possible sources of valid data about characteristics 
of judges. The first problem is approached by considering, in addi- 
tion to judged desirability, two alternative connotative properties 
of items, namely, the judged frequency of occurrence of the trait 
represented by the item and the judged frequency of endorsement 
of the item. The second problem, the use of judgmental consistencies 
as an assessment technique, was approached by evaluating the 
degree to which individuals judged items differentially under alter- 
native instructional sets. 

In series of factor analytic studies of the MMPI (Jackson and 
Messick, 1961, 1962, 1967) found two large dimensions which they 
Interpreted in stylistic terms. These two factors implicated virtually 
all MMPI scales and were not tied uniquely to any single content 
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area. Both of these factors displayed high stability across diverse 
populations. One factor consistently separated true and false-keyed 
scales and was hence identified as acquiescence. The second factor 
was highly correlated with independently obtained desirability 
scale values. The interpretation of this factor therefore highlighted 
the role of item desirability as a determinant of response con- 
sistency. It is this interpretation that we would like to re-examine 
in this paper in the light of new data and further analyses. 

Others have shared with us an interest in re-examining the inter- 
pretation of the largest factor of the MMPI. Wiggins (1962), for 
example, has proposed the conceptions of “communality” and 
“hypercommunality” for this purpose. Wiggins defined scale com- 
munality as “the average proportion of subjects in a given norma- 
tive group who answered the items in the direction in which the 
scale is keyed.” Thus, a scale showing highly asymmetrical en- 
dorsement frequencies keyed in the direction of deviance, like 
the MMPI F scale, would be a good measure of noncommunality. 
A measure with asymmetrical endorsement frequencies keyed in 
the direction of nondeviance would be a good measure of hyper- 
communality. For Wiggins, an important stylistic dimension of 
response consistency is the degree to which a subject responds more 
deviantly or less deviantly, than others, He has suggested that such 
tendencies manifest themselves as a Major response dimension on 
the MMPI and has proposed that the factor identified as desira- 
bility be reinterpreted as reflecting hypereommunality and non- 
communality. Jackson and Messick ( 1962), in an effort to evaluate 
the alternative hypotheses of desirability and hypercommunality, 
compared the correlations between MMPI first-factor loadings and 
each of two scale indices—the average MMPI value and the average 
MMPI desirability scale value. Each of these indices correlated 
highly with the MMPI first factor. But since the correlation was 
higher between desirability scale values and the first factor, the 
desirability interpretation appeared to be better supported. One 
aim of the present study is to explore these relationships further, 
treating communality not only as a response property, but as à 
judged connotative attribute of item content. 

One immediate precursor to the present analysis was a study by 
Jackson and Singer (1967). These investigators, using a complete 
factorial design, administered 20 items, comprising four scales, to 
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240 males and 240 females with instructions to judge the items under 
five instructional sets: desirable in oneself, desirable in others, 
what others find desirable, frequency, and harmfulness. Analyses of 
variance consistently yielded significant differences in the mean 
yalues of scales, of judgmental sets, and subjects. Equally impor- 
tant, all second order interactions were significant. The magnitude 
of the main effects due to judgmental sets, together with complexity 
indicated by the interactions, suggested to Jackson and Singer that 
interpretations of connotative properties of items based on only 
a single dimension, like desirability, might be insufficient. Rather 
than attempting to explain the first factor of the MMPI solely in 
terms of general desirability, or some other very broad dimension, 
it was suggested that a fruitful alternative strategy might be to 
partition such a broad dimension into relevant sub-categories, and 
to evaluate each of these in terms of their contribution to ex- 
plaining the variance of both the largest MMPI factor and of 
particular facets of personality scale content. 

It was in a similar spirit that the present study was undertaken. 
This study was designed to evaluate three inter-related problems. 
The first, is the degree to which judges, individually and in the 
aggregate, can achieve a consensus regarding conceptions of the 
frequency or communality of personality items. The second involved 
the extent to which scale values reflecting alternative facets of 
judged item frequency will predict MMPI first factor loadings in 
comparison with desirability scale values. The third question con- 
cerns the degree to which individuals reliably distinguish between 
Conceptions of frequency and of desirability in their judgmental 
behavior. 

Psychologists interested in personality assessment have for a 
long time differentiated between an individual’s actual behavior 
and what he reported about himself on a questionnaire. The lack of 
isomorphism between these two facets of behavior has been the 
Subject of a good deal of inquiry, particularly in regard to the 
nature of the systematic biases that may be operative. It was hy- 
Pothesized that this distinction between actual behavior and self- 
Teport might fruitfully be studied in the context of a judge's concep- 
tion of the relative frequency of behavior. Can à judge distinguish 
reliably between the frequency of occurrence of the behavior rep- 


Tesented by a personality item and the frequency with which the 
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item will be endorsed? If the answer to this question proves to be 
affirmative, then important new information might be derived about 
the dynamics of responding and the effects of connotative attributes 
of items upon systematic response biases. 


Method 


Instructional Sets for Judging Frequency 


Two sets of instructions were prepared for eliciting judgments of 
frequency of endorsement and of occurrence of a trait. The first set 
stated that the task of the subject was to estimate the frequency of 
a “true” response to each statement. They further stated that “It 
is to be emphasized that you are to judge the frequency with which 
people would admit to each characteristic in describing themselves 
on a questionnaire, and not the actual frequency of occurrence of 
the characteristic itself.” The second set of instructions, on the other 
hand, instructed the judge to estimate the frequency of occurrence 
in a large number of people of the trait or behavior described by 
each statement. They went on to state that “It is to be emphasized 
that you are to judge the actual frequency of occurrence of the 
characteristic and not the likelihood that people would admit to 
the characteristic on a questionnaire.” Each of these instructional 
sets was printed in a booklet together with the entire set of 566 
MMPI items. Except for the specific judgmental instructions, the 
same format was used here as was used previously in obtaining 
MMPI desirability scale values (Messick and J: ackson, 1961). 


Subjects and Procedure 


A sample of 111 male and female college students was obtained 
at Pennsylvania State University and divided randomly such that 
55 subjects judged the frequency of item endorsement and 56 
judged the frequency of occurrence of the trait. Subjects were 
assembled in a carefully supervised group session. Judgments were 


obtained on a nine-point scale, ranging from “extremely infrequent” 
to “extremely frequent.” 


Desirability Judgment Data 


In addition to judgments of the frequency of endorsement and of 
occurrence of MMPI items, data were available from an earlier 
study of judgments of desirability. Messick and Jackson (1961) 
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obtained desirability judgments of all MMPI items from a sample 
of 171 subjects, 83 males and 88 females, also from Pennsylvania 
State University. In addition to the already reported desirability 
scale values and dispersions, certain other information, such as 
the degree to which each individual judge approximated group mean 
scale values, could be obtained from these raw data. Instructions for 
the desirability judgments, similar to those used by Edwards 
(1957), are described by Messick and Jackson (1961). 


Results and Discussion 


Scale Values for Desirability, Frequency of Endorsement, and 
Frequency of Occurrence 

Judgments of frequency of endorsement and of frequency of 
Occurrence were separately scaled by the method of successive 
intervals (Diederich, Messick, and Tucker, 1957; Messick and 
Jackson, 1961) to yield a set of scale values and discriminal dis- 
Persions for each set of instructions. In addition, the method pro- 
Vides a set of category boundaries adjusted so as to normalize 
simultaneously every distribution of judgments on the same base 
line. Thus, the scale values within each instructional set can be 
Considered directly comparable on the same scale. 
y The frequency of endorsement scale values so obtained for each 
item were assembled into scales and averaged separately for 40 
MMPI scales, keyed separately for true and false-keyed items. 
The identical procedure was followed for scale values derived from 
Judgments of frequency of occurrence. Thus, jt was possible to 
obtain two numerical indices for each MMPI scale based upon 
two different types of judgments of frequency. These indices could 
Mhen be used as a criterion to which MMPI factor loadings could 
* related. Relationships so obtained could in turn be compared 
na those based upon desirability judgments and MMPI factor 
loadings. In addition to evaluating the degree to which single 
diipulative properties of MMPI items and scales account for 
actor loadings, it was possible also to appraise their effects in 
Combination. 


Standardization of the Direction of Scoring 


! Tellegen (1965) has called attention to the obvious fact that 
orrelations between properties of scales may change if the direc- 
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tion of scoring of the scale is reversed. Tellegen recommended a 
procedure for standardizing the reporting of such indices. In the 
case of desirability scale values, Tellegen recommended keying 
all scales in the desirable or in the undesirable direction, essentially 
“folding over” the scale values of desirability so that they would 
extend only in one direction. Therefore, in addition to scoring the 
MMPI desirability and frequency scale values and factor loadings 
in the usual way, we followed Tellegen’s recommendation and 
reflected low scale values. Thus, if a particular desirability or 
frequency seale value was less than 5.0, roughly the neutral point 
of the scale, the scale was scored in the opposite direction. If, for 
example, it was a false-keyed scale, it was keyed true. Similarly, 
factor loadings and values were reflected in a consistent manner. It 
should be noted that this procedure, by restricting the range of 
both variables, usually reduces the magnitude of correlations. This 
is illustrated schematically in Figure 1. It should also be noted 
that while Tellegen’s recommended procedure provides a standard 
and unique method of scoring for a particular set of scale values, 
e.g., desirability, there is a lack of uniformity in scoring across 
different types of scale information. Thus, in the present study, 
the Tellegen procedure when applied to desirability values for 
MMPI scales resulted in a set of reflected scores quite different 


from the reflected scores based on frequency of occurrence scale 
values. 


Frequency and Desirability Interpretations of the MMPI First 
Factor 


Table 1 presents intercorrelations between each of the MMPI 
connotative properties on a set of 40 true and false-keyed MMPI 
seales, including all of the original clinical scales. The data upon 
which these correlations are based are the average item scale values 
for judged desirability, judged frequency of endorsement, judged 
frequency of occurrence, and average value for each true and false- 
keyed MMPI scale. Also presented are the correlations between 
these average scale values and the vectors representing MMPI scales 
for the largest factor of the MMPI after rotation. These correlations 
are based on a sample of 194 hospitalized psychiatrie patients 
and upon a second sample of 334 college students (Jackson and 
Messick, 1962, 1967). Below the diagonal are the correlations 
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DISTRIBUTION after TELLEGEN'S 
RULE APPLIED 


ORIGINAL DISTRIBUTION 


triction of range resulting from application of Tellegen's rule. 


he original seale values; above the diagonal are the 
S based on the reflected values after standardization. 
sets of values are presented for those who may wish to 
em, the decision was made to limit discussion to results . 
i ellegen procedure. While this procedure generally arbi- 

rentuates these correlations, the values so computed do 
standard that will lead to somewhat greater com- 
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TABLE 1 
Correlations between 40 MMPI Scale Properties and Loadings on the Largest Factor 


Judged Judged Actual First Factor 
Judged ^ Frequency of Frequency of Endoresment Loading 
Desirability Occurrence Endorsement Frequency Hospital Colleg 
Judged 
Desirability E .87 .85 .68 e 
Judged 
Frequency of 
Occurrence .55 .92 .89 .18 BU 
Judged 
Frequency of 
Endorsement .92 .79 .94 42 EJ 
Actual 
Endorsement 
Frequency 


First Factor 

Loading 
Hospital 92 .80 76 78 A 

College .90 


Note—Correlations below the diagonal for MMPI scales scored in 
usual way. Those above the onal are based on average item scale values for 


diagonal are based on application of the standardization procedure for dir 
of scoring. 


There are a number of notable characteristics of the data appear- 
ing on Table 1. First, the correlations in the lower right hand 
portion of the table between the first factor loadings for the hospital 
and college samples are worthy of comment. These extremely high 
values indicate that the set of criterion factor loadings that are 
here being predicted are highly stable over diverse populations 
and conditions of administration. It is thus possible to strive 
to account for a major portion of the variance, since almost all of 
the variance appears to be reliable. 

A question of some considerable theoretical interest was the 
degree to which judges were sensitive to the actual endorsement 
frequency of personality items. Are judges on the average able to 
identify the kind of item that respondents will be likely to endorse; 

_ 8nd the kind they will probably reject? The answer is that indeed 
they can. The correlation between judgments of endorsement fre- 
quency and the proportion of actual endorsements is .94. The fact 
that this correlation is so high raised the possibility that concep- 
tions of frequency or infrequency may influence responses, either 
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individually or in combination with other connotative attributes. 
It is clear also that judges do make a systematic distinction be- 
tween frequency of endorsement and of occurrence. While these 
two sets of judgments inevitably are substantially correlated, the 
difference between them reflects a recognition that endorsements 
reflect a bias in the desirable direction. Thus, judgments of fre- 
quency of occurrence are maximally distinet from those of desirabil- 
ity, with judgments of frequency of endorsement falling in between. 
Actual endorsement frequency is most similar to judged endorse- 
ment frequency in its pattern of correlations, also reflecting some 
sort of compromise between desirability and frequency of occur- 
tence. 

Turning next to a consideration of the degree to which the vari- 
ous judgmental sets predict MMPI first factor loadings, it is useful 
to examine the degree to which the application of the Tellegen 
procedure attenuates the correlations. For example, the relationship 
between judged desirability and the first factor in the hospital 
sample drops from .92 to .68. The relative proportion of the variance 
accounted for remains consistent, however, across the two analyses. 
Desirability accounts for the larger portion of the criterion first 
factor loading variance, with judgments of frequency of endorse- 
Ment and average actual endorsement frequency values each ac- 
counting for a substantial portion. While judged frequency of oc- 
currence accounts for a relatively modest proportion of the criterion 
Variance, its relative independence from desirability caused this 
Contribution to represent largely unique variance. 

Table 2 provides an indication of the degree to which combina- 
tions of judgmental sets contribute to an understanding of the first 
MMPI factor. It will be observed that scale values based on 


. ludgments of frequency of occurrence and of endorsement in com- 


bination predict first factor loadings substantially—at about the 
Same level as do desirability scale values. Thus it is possible to 
identity two related judged dimensions of connotative frequency 
of behavior which jointly predict the first MMPI factor as well 
48 do desirability scale values. It should be noted, however, that, « 
this is not so much because judgments of frequency cover totally 
different ground, but because they overlap substantially with de- 
Sirability judgments. This is particularly true of the judgments of 
frequency of endorsement. Judgments of frequency of occurrence, 
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TABLE 2 


Multiple Correlations between Connotative Properties of 40 MMPI Scales and Loading 
the First MMPI Factor 


Multiple 
Correlation 


Fo+Fe Fe+Dy Fo+Dy Fo+Fe+Dy Fo+Fe+Dy+p 


First Hospital 
Factor | Sample .67 (.92) 77 .81 (.95) .81 (.95) 
of College 


MMPI | Sample .67 (.91) .76 .81 .81 (.95) .82(.95) | 


Note—Abbreviations are as follows: 
Fo — Average judged frequency of occurrence of the trait represented by the item in each scale. 
Fe — Average judged frequency of endorsement, 
Dy — Average judged desirability scale value, 
p — Actual endorsement frequency. 
The direction of scoring has been standardized as described in the text. Values for the usual scoring methil 
where available, are in parentheses, 


on the other hand, correlating less with desirability, do cover some- 
what different ground. This can be seen in the comparison of the 
multiple correlations. Note that the increment in the multiple 
correlation is largest when frequency of occurrence scale values 
are added to desirability. Indeed, adding frequency of endorsement 
values and actual endorsement frequency values to thése two pre- 
dictors has virtually no effect on the multiple R. Some important 
implications are suggested by these data. First, it is clear that it is 
possible to predict to a substantial degree the first factor loadings 
of the MMPI without recourse to a desirability conception; sec- 
ondly, that judgments of frequency of occurrence and of desirability 
contribute uniquely to the predication of first factor loadings, and 
thirdly, that the portion of the MMPI first factor loading pre- 
dicted by judgments of frequency of endorsement and by actual 
mean endorsement proportions is predicted equally well by de- 
sirability judgments. 

What more general implications can be educed from these re- 
sults about the nature and dynamics of responding to a personal- 
ity inventory? The findings that two connotative conceptions of 
frequency predict first factor loadings as well as do desirability 
scale values should contribute to further attempts to understand the 
psychological bases for responding stylistically to a set of personal- 
ity items. It is appropriate at the present time to go beyond purely 
statistical studies of response consistency of but one connotative 
dimension of response, and inquire into the processes involved. 

It has been demonstrated that judges on the average can approX- 
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“imate very closely the actual average level of endorsement when 


instructed to do so. Judges can also depart appreciably from these to 
make a reliable distinction between endorsements and other kinds 


“of behavior. The demonstration of this fact implies that it is 


reasonable to challenge the notion that subjects must necessarily 
be basing their responses on some implicit notion of an item's 


“desirability scale value. They could be using equally well some 
combination of conceptions of frequency and desirability, or even 
- amore complicated multidimensional array of connotative attributes 


in combination with veridical self-report. Even within the de- 
sirability conception, further work is needed to explicate alterna- 
tive meanings, like, for example, “harmfulness to others,” as dis- 
tinguished from “pathological somatic processes” (cf. Jackson and 
Singer, 1967; Messick and Jackson, 1966). 

The fact that judgments of endorsement frequency appear to be 
largely co-linear with judgments of desirability emphasizes the 
Importance of attempts to separate these processes both conceptually 
and at the observational level. One cannot very well ascribe unique 
Processes to a desirability interpretation of responding if much 
the same interpretation could have been made using a frequency 
interpretation. One method for separating these competing inter- 
Pretations would be to develop response measures uniquely rep- 
Tesenting these alternative stylistic conceptions. At least, investi- 
Bators should be aware of the differences between items and scales 
xd to desirability and frequency, where such differences 

ist. 


Item Selection Using Discrepancies in Judged Scale Values 


1 Item selection traditionally has employed at least an implicit 
Mdgmental process for delimiting an item universe and an item 
Pool. Some item selection procedures, like the Thurstone attitude 
Scaling models, embody an explicit judgmental step. Certain authors, 
like Loevinger (1957), have argued for the importance of human 
Judgment in emphasizing substantive considerations in defining 
the: item universe and in item selection. The systematic use of 
desirability judgments for studying personality item properties has 
received widespread acceptance during the past decade. It is pos- 
Sible to extend this idea to a consideration of the joint use of more 
han one dimension of connotative interpretation of item content. 
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For example, discrepancies between judged frequency of occurrence, 
of endorsement, and desirability might provide a basis for selecting 
items for heretofore unrecognized stylistic dimensions of response. 
Such a procedure was apparently employed implicitly in the devel- 
opment of the MMPI Lie Scale, where an attribute akin to desir- 
ability was contrasted with judged frequency of the behavior rep- 
resented by the item. A more systematic appraisal of such an 
approach would seem to be in order. 

It would be interesting to review in an exploratory fashion 
characteristics of items showing discrepancies in judged desirabil- 
ity, judged frequency of endorsement, and judged frequency of 
occurrence. It is, of course, recognized that a broader item pool 
than that contained in the MMPI might be necessary to obtain 
fully adequate sets of items showing differences between these 
various properties. However, a systematic evaluation of items 
selected for differences in judged attributes in the MMPI item 
pool might bear upon knowledge of response processes elicited by 
such items. Three tables were, therefore, prepared listing items 
showing discrepancies between pairs of the three judged attributes 
of item content. Table 3 presents a listing of items showing dif- 
ferences between the judged frequency of endorsement and the judged 
frequency of occurrence of the item, together with an indication of 
the quantitative degree of difference. Table 4 presents items differ- 
ing in judged frequency of endorsement and judged desirability, 
while Table 5 presents items differing in judged frequency of 
Occurrence and judged desirability. In each table, the difference 
referred to, and the basis on which the items were selected and 
ranked, was the relative arithmetic difference between the appro- 
priate pairs of scale values. It is interesting to note that many of 
the items showing an extreme discrepancy between both judged 
frequency of endorsement and judged desirability are keyed on the 
MMPI Lie Scale. But a careful perusal of these lists reveals that 
there are many more items showing less extreme diserepancy values, 
not keyed on the MMPI Lie Scale, which might reflect more subtle 
types of test-taking defensiveness, and which might be used as an 
experimental scale for measuring defensive tendencies. For example, 
the item “I never worry about my looks,” or “At periods my mind 
seems to work more slowly than usual," might reveal a type of self- 
deception or impression management which is not quite as blatant 


igher Frequency of Occurrence than Fre 
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TABLE 3 


Items Showing Discrepant Judged Frequency of Endorsement 
and Frequency of Occurrence Scale Values 


Item 


Ido not always tell the truth. 
Once in a while I think of things too bad to talk about. 
During one period when I was a youngster Lengaged in petty 
thievery. 
I gossip a little at times. 
Iam worried about sex matters. k 
Once in a while I feel hate toward members of my family 
whom I usually love. 
When I take a new job, I like to be tipped off on who should 
be gotten next to. 
Some of my family have habits that bother and annoy me 
very much. 1 
During one period when I was a youngster I engaged in 
petty thievery. 
Once in a while I laugh at a dirty joke. 
Ido not like everyone I know. Ái 
I have often wished I were a girl. (Or if you are a girl) I have 
never been sorry that I am a girl. 
Sometimes at elections I vote for men about whom I know 
very little. 
Iam very strongly attracted by members of my own sex. 
There are certain people whom I dislike so much that I am 
inwardly pleased when they are catching it for something 
they have done. 
Thear strange things when I am alone. 
There is something wrong with my mind. 
I would rather win than lose in a game. 
da to church almost every me n 
y eyesight is as good as it has lor years. 
I have never been paralyzed or had any unusual weakness 
A any of my muscles. 
aydream very little. uy 
I have never had any breaking out on my skin that has 
worried me, 
My face has never been paralyzed. 
I read in the Bible several times a week. j 
Ihave very few quarrels with members of my family. 
What others think of me does not bother me. 
Ido not have a great fear of snakes. 
Iam not easily angered. z 
It does not bother me that I am not better looking. 
Thave never had a fainting spell. 
I an usually calm and not easily upset. 
ve no fear of spiders. 1 A 
Most nights I go ta sleep without thoughts or ideas bothering 
me. 


Ipractically never blush. G 


quency of Endorsement Scale Value 
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TABLE 4 


Item Showing Discrepant J'udged Frequency of Endorsement 
and Desirability Scale Values 


Higher Desirability than Frequency of Endorsement Scale Value 


Item No. 


416 
8 

3 
74 


37 

4 
20 
66 


168 

50 
318 
433 
107 
420 

35 


Item 


Iam aspecial agent of God. 

My daily life is full of things that keep me interested. 

I wake up fresh and rested most mornings. 

Ihave often wished I were a girl. (Or if you are a girl) I have 

never been sorry that I am a girl. 

Ihave never been in trouble because of my sex behavior. 

Ithink I would like the work of a librarian. 

My sex life is satisfactory. 

Isee things or animals or people around me that others do 

not see. 

"There is something wrong with my mind. 

My soul sometimes leaves my body. 

My daily life is full of things that keep me interested. 

Lused to have imaginary companions. 

Iam happy most of the time. 

Ihave had some very unusual religious experiences. 

1f people had not had it in for me I would have been much 

more successful. 

has in a while I put off until tomorrow what I ought to do 
ay. 

Iam sure I get a raw deal from life. 

My face has never been paralyzed. 

I work under a great deal of tension. 

Most people will use somewhat unfair means to gain profit or 

an advantage rather than to lose it. 

ae certainly had more than my share of things to worry 

about. 

No one seems to understand me. 

I find it hard to keep my mind on a task or job. 

Sometimes when I am not feeling well I am cross. 

I think nearly anyone would tell a lie to keep out of trouble. 

When someone does me a wrong I feel I should pay him back 

if I can, just for the principle of the thing. 

I work under a great deal of tension. 

Iam likely not to speak to people until they speak to me. 

I have sometimes felt that difficulties were piling up so 

high that I could not overcome them. 

I have no trouble swallowing. 


Difference 


2.94 
2.79 
2.67 


—2.06 


—2.58 
—2.28 
—2.23 
—2.22 


—2.07 


—1.96 
—1.94 
—1.94 
—1.87 
—1.75 


—1.74 
-1.74 
—1.66 


—1.00 
—1.55 


preces prp iaar eR TN NNNM REN NCMO 


as that encountered in the Lie Scale. A review of many of the other 
items differentiated on the basis of discrepancies and connotative 
attributes suggests that there are probably a number of other type 
of stylistic personality scales which might be generated using judg- 
mental methods. For example, items which are neutral in desira- 
bility, but which show a discrepancy between frequency of occur- 
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TABLE 5 


Items Showing Discrepant Judged Frequency of Occurrence 
and Desirability Scale Values 


Higher Desirability than Judged Frequency of Occurrence Scale Value 


Item 


Ttem No. 
490 I read in the Bible several times a week. 
476 Iama special agent of God. 
214 


96 
3 

8 
262 
37 
87 
95 
50 
160 
107 
152 


316 


I have never had any breaking out on my skin that has 

worried me. j 

I have very few quarrels with members of my family. 

I wake up fresh and rested most mornings. 

My daily life is full of things that keep me interested. 

It does not bother me that I am not better looking. . 

I have never been in trouble because of my sex behavior. 

I would like to be a florist. 

I go to church almost every week. 

My soul sometimes leaves my body. 

I have never felt better in my life than I do now. 

Iam happy most. of the time. s 

Most nights I go to sleep without thoughts or ideas bothering 

me. 

Iam not easily angered. 

I do not have a great fear of snakes. 

I never worry about my looks. 

I would like to be a journalist. 

Ilike to attend lectures on serious subjects. 

My eyesight is as good as it has been for years. 

Once in a while I put off until tomorrow what I ought 

to do today. 

I would rather win than lose in a game. 

I do not always tell the truth. 

I do not read every editorial in the newspaper every day. 

I do not like everyone I know. ^ 

When someone does me a wrong I feel I should pay him 

back if I can, just for the principle of the thing. 

My table manners are not quite as good at home as when 

Tam out in company. 

Sometimes at elections I vote for men about whom I know 

very little. 

There are certain people whom I dislike so much that 

Iam inwardly pleased when they are catching it for 

Something they have done. 

At times I feel like swearing. 

Sometimes when I am not feeling well I am cross. 

I get angry sometimes. 

i s most people would like to get m dite 
requently find myself worrying about some 3 

Ds ina While I think of things too bad to talk about. 
gossip a little at times. 

When I am cornered I tell that portion of the truth 

which is not likely to hurt me. 

No one seems to understand me. i 

I have certainly had more than my share of things 

to worry about. 

I think nearly anyone would tell a lie to keep out 
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rence and frequency of endorsement scale values might reflect a 
pure form of deviant responding or hypercommunality—i.e., being 
deviantly nondeviant (Sechrest and Jackson, 1963)—quite distinct 
from desirability. Thus, the use of patterns of scale values based 
upon judged connotative attributes of personality items may well 
prove to be a powerful new method for selecting items to measure 
stylistic response tendencies. 


Individual Judgments of Desirability, Frequency of Occurrence, 
and Frequency of Endorsement 


Attention in the present work has been up to this point focused 
primarily on judgmental consistencies as these serve to account for 
first factor loadings, and as they might serve as an aid to item 
selection. There is yet a third vantage point from which to view 
item judgments, namely, in terms of the individual judge. If a 
judge estimates some property of 566 MMPI items, one can consider 
the subject the unit of analysis and correlate two or more sets of 
judgmental instructions over the 566 MMPI items. Taylor (1959) 
and Messick (1964) have each previously employed such an analy- 
sis. Messick computed 145 individual correlations between a sub- 
ject’s item responses and his own judgments of item desirability: 
These coefficients ranged from +.87 to —.58, with a median of 58. 
Messick interpreted these correlations as indices of subjects’ tend- 
encies to describe themselves in personally desirable terms, and 
found that they did in fact correlate significantly with a desirabil- 
ity scale, with a scale reflecting conventionality, with a scale of 
test-taking defensiveness, as well as others, Messick’s hypothesis 
that such intra-individual response consistencies may reflect de- 
fensive postures, which might be useful in interpreting judgmental 
and response measures of personality, warrants further appraisal. 

Several correlations for each subject over the entire set of 566 
MMPI items were therefore computed. Figure 2 provides a graphi¢ 
representation of the range and distributions of these individual 
correlations between a single subject’s judgment of a connotativ? 
attribute such as desirability or frequency of occurrence, and the 
Byerage scale values obtained for the items for all judges. Thus 
Figure 2 is divided into four sections, each one referring to ^ 
frequency distribution of correlations dealing with single individuals 
judging MMPI items under a specific judgmental set. In the uppe! 
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left-hand portion of the figure a frequency distribution of correla- 
tions representing the degree to which an individual’s judgments of 
frequency of endorsement, frequency of occurrence, and desirability 
corresponds to the average judged frequency of endorsement scale 
values. Similarly, in the upper right hand portion of Figure 2, a 
frequency distribution is presented of correlations between each 
- individual's judgments of frequency of endorsement, frequency of 
occurrence, and desirability and average judged frequency of oc- 
currence scale values for the 566 MMPI items. In the lower left- 
hand portion of Figure 2 a frequency distribution is presented 
. which depicts the degree to which different subjects approximated 
| actual endorsement frequency of MMPI items with their judg- 
ments of frequency of endorsement and of occurrence and desirabil- 

ity. Finally, in the lower right-hand portion the degree to which de- 
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sirability scale values are approximated by the separate judgments 
for different subjects is presented by the frequency distribution of 
the individual correlations with desirability scale value. It is to 
be noted here that the correlations are based on an N of 566, and 
that a separate correlation is obtained for each individual under 
each of the judgmental sets. For example, in the case of desirability, 
a particular subject might have judged the desirability of MMPI 
items in such a way as to yield a .62 correlation between his in- 
dividual judgments on a 9-point scale and the scale values as de- 
rived from the ratings of the entire group. These correlations were 
grouped within frequency categories, and it is these proportions 
that are presented in Figure 2. 

A review of these data reveal several interesting trends. As to be 
expected, on the average individuals were able to most closely 
approximate the scale values which were consistent with the 
particular judgmental set. Thus, while judging frequency of en- 
dorsement, they most closely approximated the frequency of 
endorsement scale values, and while judging frequency of occurrence 
they most closely approximated, on the average, frequency of 00- 
currence of scale values. The same thing held for desirability. With 
regard to approximating actual endorsement frequency, it is to be 
noted that judgments of frequency of endorsement more closely 
resembled actual endorsement frequency on the average than did 
judgments of frequency of occurrence. Here as in other contexts, 
desirability judgments have been elicited on a basis more consistent 
with judgments of frequency of endorsement than of frequency of 
Occurrence, and that, in general, judges have validly distinguished 
between these two connotative attributes. 

Perhaps most striking in these data are the wide ranges in the 
degree to which judges are able to approximate the average judg- 
ment. For example, some judges employ idiosyncratic notions of 
frequeney of endorsement, while others remained quite close un 
the consensus (Median, .64; range, .76 to .01). Some made their 
judgments of frequency of endorsement primarily in terms of 4? 
item’s desirability scale value, while a few others made a sharp 
distinction between the two (Median, .57; range, .78 to — 08). 
Some were quite accurate in approaching the actual frequency 0 
endorsement in their judgments, while a few others departed 
markedly from the true values (Median, .55; range, .75 to .00)- 
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- Generally, judges who were accurate in approximating the con- 
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sensus in their judgments of connotative attributes of items showed 
a substantial correlation with desirability scale values across all 
three judgmental sets. Those who were inaccurate tended to be 
judging all items more in terms of frequency of occurrence, perhaps 
because they were incapable of distinguishing occurrence from 
desirability regardless of the instructions. Some subjects judged 
items as if they could see little else than desirability. Others, even 
when they are instructed to rate desirability, show minimal con- 
sensus in their judgments. 

These results, although quite complex when viewed individually, 
are important if one’s aim is to use judgments to reveal something 
of the nature of the judge. The judgmental biases which are brought 
to bear are worthy of study in themselves, as are idiosyncratic 
interpretations of judgmental sets. Departures from the consensus 
judgment, particularly when they can be related to 8 particular 
class of item content or to criterion measures, may be diagnostic 
of personal values or other motivational or cognitive structures. An 
Important next step in this area will be research studies oriented 
around the question of the conditions under which such judgments 
fan, and those under which they cannot, validly reveal something 
of importance about the judges. 


Conclusions 


= 


Personality questionnaire items may be fruitfully considered in 
terms of alternative connotative judged properties, including, but 
not limited to, desirability and frequency interpretations. 

- Judges can reliably distinguish between frequency of endorse- 
Ment and frequency of occurrence. 
Actual endorsement frequency can be approximated by groups 
judging frequency of endorsement with very high degree of ac- 
curacy. 
It is possible to account for MMPI first factor loadings in terms 
of judged frequency of endorsement and of occurrence. In com- 
bination, these two frequency conceptions account for variance 
associated with MMPI first factor loadings as well as desirability 
scale values. These three judged attributes in combination were 
quite highly predictive of MMPI first factor loadings. 

Application of Tellegen’s recommendation generally attenuates 
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A 


correlations between connotative indices and factor loadings. It 
leads to standardization only when scale properties, like desira- 
bility, are considered singly, and not usually when more than one 
property is considered. 


. Item selection based on discrepancies between scale values of 


desirability, judged frequency of endorsement and of occurrence 
uncovered sets of items with interesting and potentially useful 
properties for measuring stylistic attributes of personality. 

At the individual level, judges showed very wide differences in 
their ability to approximate the consensus judgment of desira- 
bility, judged frequency of endorsement, and judged frequency 
of occurrence, some being very sensitive to instructional sets, and 
some consistently approximating one or the other regardless of 
instructions, Accumulating evidence suggests that such strategies 
of judging are potentially useful in obtaining information about 
the judge. 


. Investigations of connotative attributes of item content should 


be broadened in at least two directions: to include connotative 
and stylistic attributes other than desirability and frequency, 
and to emphasize the implicit psychological processes of both the 
judge and the respondent. 


REFERENCES 


Diederich, G. W., Messick, S., and Tucker, L. R. A General Least 


Squares Solution for Successive Intervals. Psychometrika, 1957, 
pele ae cessive Intervals. Psychom » 


Edwards, A. L. The Social Desirability Variable in Personality As- 


sessment and Research. New York: Dryden, 1957. 


Jackson, D. N. and Messick, S. Acquiescence and Desirability 8$ 


Response Determinants on the MMPI. Epucationa AND Ps¥- 
CHOLOGICAL MEASUREMENT, 1961, 21, 771-792. 


Jackson, D. N. and Messick, S. Response Styles on the MMPI: 


Comparison of Clinical and Normal Samples. Journal of Abnor- 
mal and Social Psychology, 1962, 65, 285-299. 


Jackso; 


ckson, D. N. and Messick, S. Response Styles and the Assessment 


of Psychopathology. In D. N. Jackson and S. Messick (Eds); 
Problems in Human Assessment. New York: McGraw-Hill Boo 
Co., 1967, Pp. 541-558, 


Jackson, D. N. and Singer, J. E. Judgments, Items, and Personality 


Journal of Experimental Research in Personality, 1967, 2, 70-19. 


Loevinger, Jane. Objective Tests as Instruments of Psychologic# 


Theory. Psychological Reports, 1957, 3, 635-694. 


Messick, S. Desirability Judgments and Inventory Responses in tbe 


i JACKSON AND MESSICK 293 


OOOO OC UMEN EM RR 
= I EE a- 


Assessment of Personality. Research Memorandum 64-18. 

Princeton, New Jersey: Educational Testing Service, 1964. 
Messick, S. and Jackson, D. N. Desirability Scale Values and Dis- 

persions for MMPI Items. Psychological Reports, 1961, 8, 409- 


414, 

Messick, S. and Jackson, D. N. Judgmental Dimensions of Psycho- 
pathology. Paper presented at the meeting of the American Psy- 
chological Association, New York, 1966. 

Sechrest, L. B. and Jackson, D. N. Deviant Response Tendencies: 
Their Measurement and Interpretation. EDUCATIONAL AND Psy- 
CHOLOGICAL MEASUREMENT, 1963, 23, 33-53. 

Taylor, J. B. Social Desirability and MMPI Performance: The In- 
dividual Case. Journal of Consulting Psychology, 1959, 23, 514- 


517. 
Tellegen, A. Direction of Measurement: A Source of Misinterpreta- 
„tion. Psychological Bulletin, 1965, 63, 233-243. 
Wiggins, J. S. Strategic, Method, and Stylistic Variance in the 
MMPI. Psychological Bulletin, 1962, 59, 224-242. 


| EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1969, 29, 295-301. 


THE EFFECT OF ITEM STRATIFICATION ON 
ERRORS OF MEASUREMENT 


H. G. OSBURN 
University of Houston 


Tuis paper shows that, in the case of matched item tests, the 
teduction in errors of measurement for tests constructed by strati- 

I fied sampling as compared with tests constructed by random sam- 
pling from an infinite population of items is a simple function of 
the variance of the difference between pairs of strata true scores. 

For unmatched item tests the reduction in errors of measurement 

due to stratification is a function of the variance (across strata) 
of the strata mean true scores plus the variance of the difference 
between pairs of strata true scores. 

These results predict that, in the case of matched item tests the 
largest, reductions in errors of measurement will result from strati- 
fication on item content rather than item difficulty while for un- 
matched item tests just the opposite is true. 


Reduction in Errors of Measurement for a Single Examinee 
Consider an infinite population of items that is divided into C 
strata where the proportion of items in each stratum is a constant. 
“et bo be the proportion correct score for individual p over all 
items in stratum g. Then the proportion correct score for individual 
P over the entire population of items is 


is Bh 23 t/C. (1) 


" is well known that when K items are randomly sampled from 
M item Population (without regard to stratification), the sampling 
Variance of Zp, the observed proportion correct score for individual 
P, over all possible K-item samples is 


V(g) = (5 — $)/K- @) 
295 
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Substituting equation (1) into (2) we obtain 
Ve) = [22 £./€ — (È t /CWVK. (3) 


Now consider the sampling variance of zp over all possible K- 
item samples when K/C items are randomly selected from each 
stratum. By the formula for the variance of a sum we obtain 


Y). = »» (t, = bo )/(K/O)/C?. (4) 


We wish to investigate the reduction in errors of measurement 
with stratified sampling as compared to random sampling. The 
difference between F (zp) and V (zp), gives a simple result. 


Ve) — V). 
- [C »» i. à Pe Go + Y. ta I/C = 0) 
which reduces to A i 
Vle) — Vle). = (€ z= bee — x X tuto. ®© 


which can be rewritten as 


Ve) — Ve). = E E (oe — t/C'K. 9 


It can be shown that >) >> (t, — t4)*/C^ is equal to or,” where 
LI LI 
n t o<h 
ct, is the variance of 15, across strata. Therefore, we have the simple 
relation 
Ve) — Ve), = 04,2/K. © 


Equation (8) shows that the extent to which the sampling error 
of 2, is reduced by stratified sampling is simply a function of the 


variance across strata of the true proportion correct scores for in- 
dividual p. 


Average Reduction of Errors of Measurement 


The average effect of stratification can be seen by computing 
V@) = V@), over an entire population of examinees. Since 
V@) — V), = Ple) — Üle). using equation (7) we can write 

Ve) - Pe) = X0£XG.-unyyck)yN O 


<h 
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which can be rewritten as 
Ye) - Ve). 
-EEUIUN-20dq4NC E rw /N/CK. 00 
Since Yo, ta /N = op,” + E, Dow Eub/N = Perot sore + Ef, equa- 
tion (10) reduces to p 


Te) — V@,). 
a x »» [er^ + en! — Pot T E, — §)'/0°K (11) 


« 


o<h 
ad since > 324 (Ë, — 5)^/C^ = of,” where gj,” is the variance across 
9 


I 
o<h 


: 
| strata of the stratum mean true scores, we obtain 


Me) — 9G. = E EeeCKenuE. — 09 
[] a 
o<h 
Equation (12) shows that the average reduction in errors of 
‘Measurement due to stratification is a function of the variance of 
the strata mean true scores and the variance of the difference be- 
tween pairs of strata true scores. Given equal strata mean true 
Roten and uniform interitem covariances, there will be no reduc- 
| tion in errors of measurement since under these conditions 


Ey (0: ,ix0i,7in) 
ay = A O 10 
ig E (p. onti) PBC TaT) á 
mad te mts will be zero for every pair of strata, and ct," will be zero. 
Equation (11) can also be simplified by noting that 


EY [er + oj! — Zpoan] = 1/2 >; »» for? + oe Zopa] 


[33 
= 1/2220 E on 26 to] 
=C b» dj) Co; 
Where ot? is the variance over persons of t," Substituting this result 
| nto equation (11) gives 
| Ve) — Ye). = (Darel — er + oe VE 


C ail fi gs 
1 1 a" 
This derivation was kindly pointed out by & reviewer of the manuscript. 


(3) 
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Equation (13) shows that the average reduction in errors of meas- 
urement due to stratifieation is a function of the discrepancy be- 
tween the average variance of the strata true scores and the vari- 
ance of the average strata true scores. This discrepancy will be 
zero only in the case where the strata true scores are perfectly 
intercorrelated. 


Relation to Kuder-Richardson Formula 21 (y) 


Lord (19558, p. 329 and p. 334) has shown that V(z,) is identically 
the same as the squared standard error of measurement computed 
from the KR-21 reliability coefficient. Lord's results also apply to the 
¥ coefficient defined by Cronbach, Rajaratnam and Gleser (1963) 
since y is an unbiased version of KR-21. It also can be shown that 
V(z,), is identically the same as E,[S,"(1 — +,)] where y, is a strati- 
fied version of KR-21 appropriate for tests constructed by stratified 
item sampling (Rajaratnam, Cronbach and Gleser, 1965). Thus, the 
difference between the squared standard error of measurement com- 
puted by y (KR-21) and the squared standard error of measurement 
computed by y, (a stratified version of KR-21) can be predicted by 
equation (13). 


Relation to Kuder-Richardson Formula 20 (a) 
i Consider the deviation score (z, — 2) where z is the mean propor- 
tion correct test score for a sample of examinees. By the formula 


for the variance of a difference the sampling variance of (z, — 4) 
over all samples of K items is 


V@, — 2) = Vle) + V(2) — 2 COV (e, , 3). (14) 


If V(z, — 2) is averaged over all individuals in the population, We 
obtain 


TG, — 3) = Ye) + VO — 2C0V 6, 5; a3 
but COV (,, 2) = V(2), therefore 
Te -3 = Te) -vO (16) 


Lord (1955b) has shown that V(z,) — V(z) is identically the same 
as the squared standard error of measurement computed by KR-20. 
The same result of course applies to Cronbach’s coefficient œ (Cro 
bach, Rajaratnam and Gleser, 1963). 

It is noted parenthetically that equation (16) shows clearly that 
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KR-20 or a is not the reliability of the test scores, as is commonly 
assumed, unless all forms of the test are exactly equivalent, i.e., 
V@ = 0. If the test is constructed by random item sampling, KR-20 
is the reliability of the deviation score (z, — 2). 

Tt can also be shown that V(z, — 2), is equal to E,[S,(1 — a)l 
where o, is a stratified version of æ (KR-20). Since Ple- a) = 
Üle). — V(2),, we obtain 

1-2) — Ve, —2. = Ve) — V). t VO — V. 0) 
Lord (1955b) has shown that (with random item sampling) V(@) = 


¢,;/K where o, is the variance of the true item difficulties. It can 
also be shown that (with stratified item sampling) V(2). = 22« ore / 


KC = o,°/K — o;,?/K. Thus, combining equations (13) and (17) 
we obtain 


Ye,—3 — Pe 8. = [Z orn /C — of V/K. (18) 


Equation (18) shows that the reduction in the sampling variance 
of (2, — 2) due to stratification depends solely upon the difference 
between the average variance of strata true scores and the variance 
of the average strata true scores. If the interitem covariances across 
strata are equal to the interitem covariances within rogum 
be no reduction in sampling error due to stratification even though 
the strata mean scores may differ. 

Since the squared standard error of measurement computed by o 
is equal to V(z, — 2) and the squared standard error of measurement 
computed by a, is equal to V (z, — 2)., the reduction in sampling error 
due to stratification as computed by a and o, can be Lud 
equation (18). 


Empirical Studies of the Effect of Stratification 


Cronbach, Shénemann and McKie (1965) have published an 
empirical study comparing estimates of Ey(pryey°) by o and o, when 
are constructed by stratified item sampling. While the present 
study has no bearing on the question of how accurately a and æ, 
estimate E, (5... ?), equation (18) does serve to rationalize one of the 
General conclusions of this study. “Stratifying on content is clearly 
more important than stratification on difficulty, both in test construc- 
tion and test analysis. The so-called difficulty factors that have re 
ceived so much attention from some test theorists prove to have very 
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little influence on æ coefficients unless r, is unrealistically high?” 
(Cronbach, et al., 1965, p. 331). dia 
Equation (18) shows that the discrepancy between the stratifie ( 
and unstratified error variances is a maximum when the strata 
scores are completely independent. Equation (18) also shows 
the differences are magnified by increasing the within strata inter 
item covariances. Thus, the largest discrepancies occur with short 
tests that are homogeneous within strata and heterogeneous acre 
strata. 
Equation (18) also shows why stratification on difficulty has a | 
small but detectable effect as was found by Cronbach, et al. (1965). 
Stratification on difficulty reduces p,, by virtue of the systematic 
discrepancies in item difficulties across strata which lower 
E i05 iniia). 
Shoemaker (1966) using a design very similar to that of Cron- | 
bach, et al. (1965) studied the discrepancies between y and yi | 
when stratified sampling is used and each individual is given à i 
different test. In contrast to the conclusions about the effect of 
stratification on æ coefficients, Shoemaker concluded that the larg | 
est discrepancy between y and y, occurred when the items were | 
stratified on difficulty rather than content. Equation (13) shows | 
why this is true. The difference between the error variance com 
puted by y and the error variance computed by y, is a function 
not only of the between strata correlations but also of the variance 
of the strata means. Since stratification on difficulty produces 
large differences in strata mean scores, this factor is controlling. 
Shoemaker (1966) also found that stratification on content pro- 
duces about the same effect on y errors as it does on a errors. 
Thus, the conclusion seems to be that for unmatched item studies 
stratification on difficulty is more important than stratification 0n 
content while for matched item studies the opposite is true. 
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ESTIMATING TEST RELIABILITY FROM THE 
ITEM-TEST CORRELATIONS 


RICHARD H. GAYLORD 


Hydrotronics Division of Data-Design Laboratories* 
Falls Church, Virginia 


Gururogp (1950) gives a formula for estimating test reliability 
from the number of items and the average item-test correlation; 


nie Í 
WE 
1+ (n — lik o 
where ?, = the average item-test correlation 
andn = the number of items in the test. 
Guilford obtained this relationship using the Spearman-Brown 
formula and the relationship, 


Fay = Fe’, (2) 


Te 


where 7;; = the average item intercorrelation 
and 7, = the square of the average item-test correlation, 
| obtained from Richardson (1936). 
m the course of examining some of the relationships of test 
Teliability, item intercorrelations, and item-test coefficients, it be- 
came clear that use of relationship (1) leads to anomalous re- 
sults. The problem was traced to relationship (2) which is in 
error, The correct statement should be 
AT nf? —1 (3) 
n—l1 
Expression (3) can be obtained in the following manner. 
An item test correlation is written as a correlation of sums and 
*The opini be con- 
seed releele pol of Hotere Divos af Daia-Desiga Labore- 
ses 
$ 303 
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simplified by assuming item standard deviations to be substan- 
tially equal. The result is 
PUMA 1+ ((n-0ri 
Vn 4- n(n — Yr; 
where 7;;' is the average of the correlations of the ith item with the 
remaining n — 1 items in the test, 


() 


and the remaining terms are as previously defined. Both sides of 
(4) are then summed over all items and divided by n to yield 


sz Gian d n(n — Dr; (5) 
"nM 4 n(n— Dr, 
Simplifying and solving for 7,; yields expression (3), the relation 
we sought. 
Following Guilford, the Spearman-Brown formula is used to 
estimate the reliability of the test from the average item inter- 
correlation and expression (3) is then used to replace the average 


item intercorrelation with the average item test correlation. The 
result is 


pu Rasa (6) 
i (n — Dr 
the correct expression for the reliability. 
Comparing expressions (3) and (6) shows that 
Ni fy o 
an unexpected relationship. 

The relationships (3), (6), and (7) are not presented as of 
practical value in test analysis. The average item intercorrelation 
bas few uses and test reliability is more efficiently estimated with 
the Kuder-Richardson equations. The relationships given should, 


however, give increased understanding of the relationships be 


tween item intercorrelations, item test correlations, and test 1°- 
liability. 
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ESTIMATING THE RELIABILITY 
OF PEER RATINGS 


LEONARD V. GORDON 
State University of New York at Albany 


Purr ratings or buddy ratings are widely used for measuring 
certain characteristics of group members. This type of rating 
has been found to yield significant predictions of various per- 
formance criteria (see Hollander, 1957 and Smith, 1967) and is 
relatively independent of group composition (Gordon and Med- 


land, 1965). 
The peer rating approach involves all members of a group 
teristic. Each person’s score 


evaluating one another on some charac 
on the characteristic is based on the assessment made by all of his 
peers, and the reliability of the peer ratings is determined by the 
extent to which all members of the group agree in their assess- 
ments of one another. The magnitude of the reliability is 8 
function of such factors as visibility of the characteristic to group 
members, amount of contact with one another and number of 
individuals making the ratings. , 

Hollander (1957) indicated that the split-half method was con- 
ventionally used for estimating the reliability of peer ratings, and 
the writer’s recent survey of the literature revealed that this is 
still the case. The split-half procedure requires dividing the rating 
sheets into two halves, generating scores for the two halves, 
computing their intercorrelation, and applying the Speatman- 
Brown prophesy formula. However, 9 well-known problem with 
the split-half method is its lack of uniqueness, because different 
coefficients can result from different splits. Coefficient alpha (Cron- 
bach, 1951), which is equivalent to the mean of the stepped-up 
Coefficients resulting from all such possible splits, and equivalent 
formulae, resolve this deficiency, but to the writer’s knowledge 
have not been applied to peer rating data. 
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The estimation of the reliability of peer ratings essentially involves 
the determination of the average intercorrelation among judges’ 
ratings and the correction of this value to account for the number 
of judges. In peer ratings, each of n judges rates (n — 1) subjects 
where the subjects happen to be the judges themselves. The average 
intercorrelation of ratings by pairs of judges (7) is adapted from 
Peters and Van Voorhis (1940, p. 197) as 

So — 287 


* a= Dns? k 


where n is the number of judges, S2; is the variance of any judge 
under the assumption that all judges have the same variance, 
C is the combined (summed or averaged) rating obtained by 
each subject and Sg? is the variance of these combined ratings. 
The reliability of the pooled judgements of the n judges (rw) 
is found by entering (1) into the Spearman-Brown prophesy 
formula. Simplifying, we obtain 


Un S nS; 
tm = 2 nm) a 
The judges failure to rate himself does not effect the value of 
So? since this omission is uniform for all judges. Since all 
judges are required to distribute the same set of weights to the 
remaining n — 1 subjects, Sj? will be equal for all judges. 
Formula 2 is identical with coefficient alpha for this situation 
and is applicable where any number of high and/or low ratings 
are uniformly made by all judges and where any weighting 
Scheme is applied. Computational formulae may be readily derived 
Tor various rating and scoring procedures. Let us consider the 
three procedures that are the most commonly employed.? 
In the first case an equal number of high and low ratings are 
made by each judge and the weighting scheme for the high and low 


1 Formula 2 is 
shown that where raters have 


veloped for use with averaged ratings or all posit} i ive iden 
Men B ES positive weights, would giv 
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selections is symmetrical. Where W represents each weight assigned 
at either end, but not both ends, nS? is found to equal 2n3W?/ (n — 
1) and Sc? becomes XC?/n. Substituting into (2) and simplifying, 
we obtain 
n n 2n 35 W^ 

Eel m=- (44) LBE ] 9 

In the second case, only high (or only low) ratings are obtained. 
In this instance, nS? is found to equal [n(n — 1)3W* — n(3W)*]/ 
(n — 1)? and Sj? becomes [3C? — n(3W)?]/n. Again, substituting 
into (2) and simplying, we obtain 


Case II n, = —. [:- C e EM 


1-1 -1 e — n( 
vi (4) 


It may be noted that formula 4 degenerates to 3 when upper 
and lower ratings are obtained and when symmetrical weights 
are used since XW = O and SW? is doubled. 

The third case requires each judge to rank all subjects excluding 
himself. Each subject’s score (C) is the sum of the ranks assigned 
to him. In this case nS,? will equal n?(n — 2)/12 and So? becomes 
[43C? — n? (n — 1)2]/4n. Substituting into (2) we find 
Case III Tan = RE mE 

qe 


n*(n — 2) 
[zx TY 9 


These and related formulae are easy to apply since they 
utilize simple constants and values of C which will already have 
been computed. Following are illustrative computational examples 
for the three cases, each case involving 6 judges. 


TABLE 1 
Differentially Weighted Peer Ratings 
mm L O EM PEE 
Ji 
by i e e 
1 x 1' =1% 71 1 0 3 0 
2 -1 X =1 1 -2 28 9 
3 -2 -1' KEL 2225-3 m o 
4 2 1 x 1 2 6 36 
5 Bo E -1 -5 25 
6 1 2 2 2 2 X 9 8 
Ec = 200 
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TABLE 2 
Equally Weighted Peer Ratings 
(High Choices Only) 
Judge 

Choice 1 2 3 4 5 6 c e 
1 x 1 2 4 
2 x 1 1 1 
3 X 0 0 
4 1 1 x 1 1 4 16 
5 X 0 0 
6 1 1 1 1 1 x 5 25 
xc = 46 


In Table 1 each judge was required to make 2 high and 2 low 
choices. The first and second (high) choices were weighted (W) 2 
and 1 respectively, and the next to last and last (low) choices —1 
and —2 respectively. Entering formula 3 with n = 6, 3W? = (1* + 
2?) or 5, and XC? = 200, we have 

ues nes Gè) 2 5 ^ 

Cae I c, zu 1/290 | = 768 
In Table 2 each judge was asked to make only 2 high choices, each 
of which was given a weight (W) of 1. Entering formula 4 with 
n = 6, ZW = (1 + 1) or2, ZW" = (1° + 1°) or 2, and EC” = 46, 
we have 

6 6 Y(e—109 — 2 
Case II -s&l - (c4) © = 09) = Q* | _ 799 

pests b en) a6 — 63) 

In Table 3 each judge ranked the remaining five individuals. En- 


TABLE 3 
Complete Peer Rankings 
3 Judge 
Choice 1 2 3 4 5 8"«0 0 
M. e.c C — 

1 x 3 4 2 2 4 15 225 
2 2 SS 2 4 3 1 i2 144 
3 H 2 x 1 1 3 8 64 
4 5 4 3 X 4 5 21 441 
5 3 1 1 3 X 2 10 100 
6 4 5 5 5 5 X 24 576 

Y C? = 1550 
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tering formula 5 with n — 6 and XC? — 1550, we have 


PERO LS (6) (6 — 2) a 
Case III Tan = $—1 E 121880) — 3006 — D 768 


The specific ratings presented in the above tables illustrate two 
‘points, First, differential rating schemes may be expected to yield 


higher reliabilities than those employing unit weights, but only 


when the raters are capable of reliably making the finer dis- 
criminations required by the former. When they are unable to do 
s, the more refined schemes may well yield lower reliabilities 
than the more gross. Table 2 consists of the same high choices, unit 
weighted, that appear in Table 1. It will be noted in Table 1 
that there is reasonable agreement among judges not only in who 
should be high-chosen and low-chosen but in the ordering of 
individuals within these categories. Thus the resultant reliability 
is higher than that which probably would have obtained from 
grosser discriminations such as those in Table 2. On the other 
hand, if the judges had been unable to discriminate reliably 
within the high-chosen and low-chosen groups, the relative values 
of the reliabilities might well have been the reverse. (For example, 
if for alternate judges the ordering of choices within categories 
is reversed the reliability for Case I will be found to be lower 
than that for Case II). 

Second, when Case I data yield the equivalent of complete 
rankings it would be expected that the resultant reliability would 
be identical to that obtained with the Case III formula for 
transmuted data. This occurs in the present instance. If the con- 
stant n/2 or 3 is added to each score in Table 1 the rankings 
found in Table 3 will result. The two sets of data had yielded 
identical results, However the reader should be cautioned that 
ordinarily Case I data will not form complete ranks and the two 
approaches will not give equivalent results. In fact, where n is 
moderately large, judges usually will not be able to make reliable 
discriminations within the middle of the range and rankings i 
tend to have lower reliability than ratings. 


Discussion 
In classical test theory it is usual to classify the set of items 
that comprises a test into either unstratified composites or strati- 
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fied composites. The first consists of items that are not clustered 
in any known manner but are drawn as if by random sampling 
from a similarly constituted domain of items. The second con- 
tains subgroups of items that are drawn as if by stratified random 
sampling from a similarly constituted domain (Tryon, 1957). 

The unstratified domain model seems most applicable to the 
typical peer rating situation. It is reasonable to assume that in 
most instances members of natural or designed peer groups will 
be equivalent to one another in their characteristics as raters 
and further may be considered to represent a random sample 
from an unstratified domain of raters having the specified inter- 
personal exposure. The findings of Gordon and Medland (1965) 
that, for the domain under study, peer rater performance was 
relatively uninfluenced by group composition, supports the reason- 
ableness of the above assumption. Formula 2 is appropriate for 
estimating the reliability of peer ratings under this particular 
model, 

However, on occasion, groups will be found for which the 
stratified domain model would be more appropriate. Such groups 
would be characterized by a division into subgroups (or cliques), 
with members of a given subgroup making high choices primarily 
from among their own subgroup members, and (where called 
for) low choices from among those of the other subgroup. Tryon 
(1957) noted that for stratified composites, such as these, re- 
liabilities based on a Single sample are indeterminate and that 
formulae such as coefficient alpha are inapplicable. Similarly, for- 
mula 2 could not be used in this situation. Reliability estimates 
would require the use of a parallel test sample. The split-half 
technique, followed by the Spearman-Brown correction, could 
serve in this regard. While tests could be split on an a priori basis 
80 as to provide equal subgroup representation, peer groups typi- 
cally could not. Stratification, in the latter case, usually would be 
discovered a posteriori and this type of splitting would be pre- 
cluded since it would capitalize on chance. Only a random 
splitting, which would tend to underestimate reliabilities, could 
be used.3 

An interesting parallel is found between formula 2 and Kuder- 


a 

? Tests of subtest consistency, such as modified coefficient alpha (Cronbach, 
pk ] pha 

1951), would be similarly precluded with & posteriori identification of clusters. 
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Richardson formula 20, a special case of coefficient alpha, in 
that when the average intercorrelation among raters, in the 
first instance, and among items, in the second, is negative, each 
formula will yield a negative value. Kuder and Richardson (1937) 
indicated that to treat such values as negative reliability is in- 
admissable, since in its formulation reliability is "the characteristic 
of a test possessed by virtue of the positive intercorrelations of 
the items composing it.” (Negative reliability is similarly denied 
in formulae that utilize variance ratios). Accordingly modest 
negative values are traditionally treated as random deviations from 
Tero. On the other hand, systematic complexity within the peer 
tating group, such as subgroup stratification, will result in large 
hegative values (sometimes in excess of —1.00) when formula 
2 is used. In such circumstances, formula 2 will reveal the failure 
of the data to fit the unstratified domain model and thus its own 
Inapplicability. 

There is an inherent problem in estimating the reliability of 
Deer ratings in that each judge is concerned with a slightly 
different set of subjects, since he may not rate himself while 
the remaining judges are required to rate him. This problem 
exists in making split-half estimates as well as those presented 
here. For example, reliabilities of 1.00 are unattainable for peer 
tatings* as conventionally obtained and scored since any individual 
Placed ina high (or low) group by any rater cannot be so 
Placed by himself. Fortunately, this limitation is not serious in 
the usual peer rating situation. The maximum reliability attainable 
is largely a function of group size and reliabilities of .99 and .95 
Tespectively may be obtained for groups of 16 and 9 members 
With conventional rating schemes. The maximum reliability for 
any rating or scoring scheme and any sample size is readily 
calculated by introducing the maximum value of XC? into the 
Appropriate formula’ and solving for fan- 

——— 


E me alpha has this same limitation when item difficulties are un- 
* For Case IT this value of ZC! will equal X:-il(n — Ws + 6 — Wal + 

ee where k is the number of bres Ens each judge and where these 

à s weighted WW; - - - Wx in descending order of magnitude; ze st 

the o; W*{(n — 1) + k] where the choices are equally weighted. For Case I, 

val responding values of ZC would be doubled. For Case III, the maximum 
lue of ZC? is given by Y:2-1 (in — 2i + 1}. 
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Peer rating reliabilities will be obtained in some instances to 
determine the extent to which a given characteristic can be 
evaluated by individuals having some defined interpersonal ex- 
posure, and in others, to compare several different characteristics 
in this regard. Frequently, only natural groups of relatively 
small and/or varying size will be available to the investigator. 
Group size, while generally not a variable of interest, will deter- 
mine the maximum attainable reliability and will effect the mag- 
nitude of the obtained reliability. Accordingly, some control over 
this influential but extraneous variable would be important. 

One approach to this problem would involve the adoption of 
a “standard group” of say 16 raters as a uniform frame of 
reference. For this size group, reliabilities of near unity (.99) 
are attainable and interpersonal exposure similar to that possible 
in smaller and somewhat larger groups could reasonably be as- 
sumed to occur. The obtained reliability for the given group 
could then be used to estimate the extent to which the “standard 
group” of raters, representing a random sample from the same 
unstratified domain, would agree in their evaluations of one an- 
other on the particular characteristic, Specifically, the reliability 
of the “standard group” would be predicted by entering the ob- 
tained reliability together with the multiplier required to raise 
(or lower) the given group size to 16, that is 16/n, into the 
Spearman-Brown prophesy formula.® 

The above procedure would be particularly useful for inter- 
preting obtained reliabilities where the maximum attainable re- 
liabilities are unduly restricted by group size. By way of example, 
for the data in Table 1 the obtained reliability based on 6 raters 
was found to be .768 (a maximum value of .891 was possible). 
When n = 6 and tm, = .768 are entered into the simplified 
Spearman-Brown formula,’ a reliability of .898 is estimated for 
a group of 16 raters. This outcome suggests that the characteristi¢, 
as defined, would be ratable at a respectable level of reliability by 
“standard groups” from this particular domain of raters. 

Where peer rating scores are to be considered for decision 
making purposes, the above procedure cannot be legitimately 


^ In this application, the Spearman-Brown prophesy formula will simplify 


pa mee + (16-1)ren]. This formula is applicable when n is larger 0" 
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‘plied. Here the reliability coefficient will serve to indicate the 
extent to which the present group members do agree. The in- 
ability of the group to manifest a high level of agreement due 
to the constraints of sample size, normally would argue against 
the operational use of their ratings. However, the use of peer 
tating scores by the decision maker would be defensible where 
compelling validity data had been obtained previously with sim- 
ilar groups having equivalent or lower reliability. 

Formula 2 and derivatives thereof should prove useful for ob- 
taining estimates of the reliability of peer ratings. These formulae 
provide more stable estimates than does the split-half procedure 
and also involves much less computational effort. In the occasional 


"tase where the rating group is stratified, the inappropriateness of 


formula 2 will be revealed by a high negative value. The random 
split-half procedure could then be used. 
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Ranz totals that accrue to each object from the paired-com- 
parison frequency matrix can be linearly transformed to yield 
scale scores for the objects. Tests of significance may be applied 
to the rank sum differences depending upon the objectives of the 
saling process. 

Where the Bradley-Terry (1952) and Thurstone (1927) pro- 
cedures yield scales based on a normalizing transformation, the 
tank scale is a variance-stable scale, as is, e.g., the arosine trans- 
formation of percentages. This is immediately apparent from the 
fact that any given rank-sum difference has the same significance 
20 matter where the rank totals may be located on the scale. Thus 
the variances for all scale values are equal, a feature that other 
methods assume but do not guarantee. Moreover, with a given 
Number of judges and items, scale variances are independent of 
> nature of the items being scaled. Therefore the technique 
ends itself to scaling a wide variety of stimuli. 
^on of multiple-comparison tests of a significance would appear 
E. € important adjunct to the scaling process for the following 


P e tests can provide a basis for making decisions about 
he her to consider two items as coming from the same popu- 
ation of stimuli. This permits use of the technique in building 
Parallel forms of scaling instruments. If two objects, for example, 


315 
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have a difference in rank sums of 7, and this difference could 
occur with a P of .999, then the difference between the objects is 
very small, probably a chance difference. When pairs of such 
objects are assigned to each of two forms of the same instru- 
ment, the forms may reasonably be assumed to be parallel. 

2. Significance tests can be used to determine categories of 
stimuli which can be considered discrete. In certain scaling tasks 
it might be desired to assign individuals to categories which have 
been nominated as significantly different. If, for example, the 
rank difference between two objects is so large that the difference 
would occur by chance only one per cent of the time, the objects 
may be regarded as discrete. 

3. Bignificance tests can be used to build an index of scalability 
(SI) for psychological objects, analogous to the coeficient of 
reproducibility for cumulative scales, Anderson and Bashaw (1967) 
derive a scalability index from the ratio of pairs of objects which 
are significantly different from each other to the total number 
of pairings of the objects. If, for example, seven items are 
scaled and of the 21 possible pairs 14 are significantly different, 
the SI is 14/21 or .67. 

4. Significance tests provide a way to ascertain the proper sample 
size for instrument development. After a level of significance is 
chosen, the sample size necessary to insure that all objects have 
the opportunity of being significantly different can be easily 
calculated. If, for example, one wishes to provide the opportunity 
for seven psychological objects to differ significantly at the 40 
level, a sample size of at least 68 is required. 


Review of Literature 


Mosteller (1958) suggested utilizing the paired-comparison fre- 
quency matrix to obtain rank sums as an initial step in psycho 
logical sealing. Guilford (1954) and Rummel (1964) also pro- 
La similar techniques for scaling ranked or paired-comparison 

ata. 

Dunn-Rankin and Wileoxon (1966) investigated the true dis- 
tribution of the differences between ranks in the two-way classi- 
fication, i.e., a judges by items analysis of variance. They derived 
true probability values for rank differences where the number of 
Judges was less than or equal to 15. They also showed that for ? 
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‘eater number of judges the normal approximation to the true 
distribution gave accurate probabilities. 

Any paired-comparison frequency matrix where circularity is 
minimal can be considered a two-way analysis of variance by 
yanks. In making multiple comparisons between objects, the dis- 
inbution of maximum differences between rank totals, ie, the 
distribution of the range of rank totals, can be used to make 
tests of significance between the objects for a specified number 
of judges (Wilcoxon, 1964). 


A Simplified Rank Method of Scaling 


The scaling method presented here uses the judgment or the 
Tesponse approaches (Torgerson, 1958) where the stimuli are as- 
signed scale values. In the behavioral sciences the stimulus-centered 
scale usually deals with a central concept of interest, e.g., United 
States participation in the war in Viet Nam, career preferences 
of high-school students, or value judgments of first-graders. The 
items are directly related to the concept under consideration. 

The assumption is made that the reaction of a person toward 
^ unidimensional concept distributes the stimuli related to the 
concept on a unidimensional continuum in such a way that each 
item may be compared with any other in terms of the subject’s 
preference, i.e., the stimuli may be ordered. 

In the derivation of a scale from an ordering of the items, the 
relative spacing among the stimuli is involved. Such spacing is 
deduced from confusions among the stimuli. If stimulus B is more 
often confused with A than is C, then B is closer than C to A. 
The relative spacing can be deduced from the rank values accruing 
to each item as a result of inter-item comparisons by a number 
of judges, 

The proposed method of scaling, referred to as the Simplified 
Rank Method, involves the following steps: 

l. A sufficient number of judges are selected to insure that 
each item has the opportunity of being demonstrated to be 
significantly different from every other item. This is accomplished 
" any number of items (I) and for any alpha level by choosing 

e appropriate value of Q, from Table 1 and solving for J in 
the following formula: 


J = Q,2(I) (I + 0/12 
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TABLE 1 
Harter's Qa = W/S Values for 15 Items where a = .01, .05, .10 


Items .01 .05 .10 
3 4.120 3.314 2.902 
4 4.403 3.633 3.240 
5 4.603 3.858 3.478 
6 4.757 4.030 3.661 
7 4.882 4.170 3.808 
8 4.987 4.286 3.931 
9 5.078 4.387 4.037 

10 5.157 4.474 4.129 
11 5.227 4.552 4.211 
12 5.290 4.622 4.285 
13 5.348 4.685 4.351 
14 5.400 4.743 4.412 
15 5.448 4.796 4.468 


df = oo 


Table 2 presents the number of judges (J) necessary at the 
01, .05, and .10 probability levels for I = 3(1)7. For the I and 
alpha levels not listed, a solution can be obtained by solving for 
J in the formula above, where Q is taken from Dixon and 
Massey (1957) or from the complete set of values presented by 
Harter (1959b). 

2. The stimuli are ordered. If the items are ranked in order 
of preference by the judges, then values, J, J —1... 2, 1, are as 
signed to the I items. The values assigned reflect the magnitudes of 
the preference. 

If the ranking is derived from paired comparisons, an item- 
by-item frequency preference matrix for each judge is established, 
in which the ay cell contains a “1” if the item j is preferred 
to item i, otherwise a “0,” 4 standing for the rows and j for the 


TABLE 2 


Number of Judges Necessary to Insure the Possibility of I Items Being 
ificantly Different 


Number of Judges (J) 
.05 .10 


Items .01 
pi MsOLNoNNGOb EE - 7 — 
3 i 11 9 
4 33 22 18 
5 53 38 31 
: 80 57 47 
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Wlumns. Zeros are placed on the diagonal. For a single in- 
dividual, the sums of the column entries yield a rank ordering of 


items. 
Paired-comparison choices can also be counted and a rank order- 


g established from the number of votes accruing to each of 
e items. For example: 


Observer 1* 
Votes 


Which is the same as would have occurred if an observer had 
Assigned rank values of I, I — 1, I — 2,...2,1 to the I items and 
‘ithe had been consistent in making his paired comparisons. From 
P column totals of the example above, the ordering is 4, 3, 

JL 

Tf inconsistencies are present, i.e., circular triads, then the column 
‘sums will contain duplicates and not all order values from 0 to 
I — 1 will be present (Coombs, 1964). For example: 


* In each pair the underlined item was favored. 
1 If 1 is added to the column totals, a rank-ordering results 


Observer 2* 


E 
| ig 


Al D 
AC A 001 
0 


* In each pair the underlined item was favored. 


S 3. Rank totals for each item are obtained by summing the 
ank orders for all judges. For paired-comparisons data, it is 
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sometimes simpler to establish a frequency matrix for all the judge: 
&nd add the value of J to the column totals. The sum of the 
rank totals must equal 4 


R= (VJ) (I) (I + 1)/2. 


4. The minimum and maximum possible rank totals are deter- 
mined from the following formulas: 


Ran = J 
Rmx I(J). 


5. An initial scale is formed by a linear transformation of the 
rank totals to a scale of whole numbers or mixed decimals from 
0.0 to 100.0. The absolute value of Rmin is subtracted from each 
rank total, each difference is divided by the range, and the quotient 


is multiplied by 100: gya 


SV = G— Ra) en - 100. 2 
max min 


6. For values of I = 3(1)15 and J = 3(1)500, critical ranges nece 
sary to consider any two items significantly different are obtained b; 
consulting the critical range table as given by Dunn-Rankin (1965 
or Dunn-Rankin and Wilcoxon (1966). Critical ranges can also t 
calculated for more than 15 judges by multiplying the expect 
standard deviation for rank scores (S = 4/J(I)I + 1)/12) by value 
for Q, found in Table 1, as found in Dixon and Massey (1957), or a 
given by Harter (1959). The Q, values are taken for Z and infinite 
degrees of freedom. 

7. The items of the primary scale may be systematically 1 
sealed using tests of significance based on the range of ür 
original rank totals. Items not significantly different are groupes 
together. The average score value of those items not significanty 
different will usually be chosen as the final scale score for thoi 
items. However, if some external criterion is available, anoth 
item score or the average score for the group of items may ke 
chosen. The choice of a significance level will be largely dé 
pendent upon the sealing objectives, i.e., a low probability woulc 
increase scalability and a high probability would facilitate selectior 
of parallel items. E 

8. A Scalability Index (SI) is derived by finding the ratio 95. | 
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“the number of significantly different pairs of objects to the total 
"possible number of pairs. An SI of one (1.0) indicates perfect 
scalability, whereas an SI of zero indicates no significantly differ- 
ent pairs. 
Discussion 

If the number of items is small (ie., < 5) it is probable that the 
‘items can be ranked accurately without resorting to paired com- 
parisons. The inference is that mental comparisons of each item 
with all other items have been made in order to achieve the 
rankings. Witryol (1960) used the paired-comparison technique 
with only five items because he was working with young children 
"as judges. If the items are unfamiliar or subtle in nature, it may 
be desirable to make paired comparisons on as few as three items. 

When a scale is constructed from a set of items, it is usually 

pected that certain items will not be significantly different from 
hers in the set. In that case a smaller number of judges can be 
utilized than is called for in Table 1. When this is done, however, 
Jhe number of items that are significantly different, and therefore 
ycalable, must be reduced. 
^4, The scale scores which correspond to Rmin and Rmax define the 
xtreme values for the item scores on the final scale. If X is the posi- 


le 


P of the Rain scale value and Y the position of the Rmax scale value 
"pn a line representing the unidimensional concept, then the scale 


E. for the items in question must be located along this line. 


EX A BC D F Y 


£ 
CHEERS 


all the items are scaled between X and Y, such items do not 
*present the extreme attitude of the population. If, however, an 
“em scales at the minimum or maximum, the stimulus value of 
“sat item cannot be determined without increasing the number of 
*idges or using more items. 
^ud One of the objections that Mosteller (1958) had to a scaling 
Jnethod based on rank totals was the inability to recapture the 
Tequencies from the scale scores. Frequencies can be recaptured 
rom the rank scale score by means of the following formula: 


I- = 
r, = C- DOR La, 
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where F!, is the recaptured frequency for cell ,, ;, I is the number of 
items or objects, J is the number of judges and R; and Ry represent 
any two rank totals. An example is given below, where three objects _ 
(I = 3) have been compared by twenty subjects (J = 20) in a ficti- 
tious experiment. The scale values (SV) are derived by: 


R: — Rain 
Rass — Rain 
where Ea is (Z — 1) (J) and E, equals zero. In this case the 
number of judges, J, is not added to the totals in the frequency 
matrix in order to simplify the computation. The value of J is added 
only to make the analysis consistent with an analysis done on rank- 
ordered data and has no effect on the scale values. 


Sy = - 100, 


" F Matrix 
Roses A B (o Rata 
A 6 2 
B 14 9 
c 18 11 
40 32 17 11 0 
SV 100 80 42.5 27.5 0 


Fıs = A80 — 42.5) +10 = 15 
R2 .4(80 3 27.5) +10 =17 
Ri ss ses- 27.5) +10=12 


F' Matrix 
mm Iu 
A B [6] 
I OO ;o0o O O a 
A 5 3 
B 15 8 
[9] 17 12 


It would appear that a chi-square test comparing the off-diagonal 
cell entries of the F and Ft matrix could constitute a test of how 
well the frequencies could be recaptured. 
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Comparison with Other Techniques 


A comparison between the simplified rank scaling method, based 
on the rank totals derived from an item-by-item matrix, and 
Thurstone’s Case V model involves Hill’s (1953) data, as reported 
by Edwards (1957). A frequency preference matrix, derived from 
paired-comparisons data for 94 judges and seven statements con- 
cerned with United States entry in the Korean War, is given in 
Table 3. 


TABLE 3 


Frequency Preference Matrix for Severn Statements about 
U. S. Participation in the Korean War 
(aj; = preference of a; > aj, where j = row and i = column) 


Items 1 2 3 4 5 $46 7 


1 — 65 75 30 7 86 88 
2 29 — 51 53 62 68 8 
3 19 — 43 — 49. Pep Neo TS 
4 14 X 40 45. — 0 749 eB eee 
5 19 -32 35. 4 — M 5 
6 Bo i » 3 4 -— JW 
7 6 13 31 27 39 37 — 


Totals 95 219 271 280 327 305 p 


E step-wise results of the rank method of scaling are illus- 


1. Rank sums are derived from the total frequency preference 
matrix by addition of 94 (the number of judges) to the column 


totals, 
Totals 95 219 271 286 327 305. 41l 
No. of Judges +94 +94 +94 +94 494 +94 +94 
R, i50 319 365 390 421 459 505 
3 Ry = 2032 


da The sum of the rank totals and the minimum, maximum 
average ranks are compute d. 

IR; = 2632 

Rain = 94 

Rmss = 658 
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8. Scale scores are found for each Rj, including Rmin and Ry. 


Items R; Ri = Bai Ri = Rain, 100 
range 

Min 94 0 0 
1 189 95 17 
2 313 219 39 
3 365 271 48 
4 380 286 52 
5 421 327 58 
6 459 365 65 
si 505 411 73 

Max 658 564 100 


4. Subtracting the value of Rmin from each rank sum, dividing 
each resulting value by the range, and multiplying the quotients by 
100 yields the initial scale. 


Min 1 2.3400 7 Max 


0 17 39 48 5258 65 73 100 


A decision could be made to select those items which are most 
widely and evenly separated for use in scaling attitudes. There 
being 94 judges, the opportunity for all the items to be signi- 
ficantly different at the .05 level is provided for. The critical 
range table (Dunn-Rankin, 1965) shows that a critical range of 
87 is necessary for considering any two items significantly 
different at the .05 level. Alternatively, the critical range can be 
computed by 


Critical Range = S-Q, 
CR = (20.94) (4.170) 


CR = 87. 


Comparisons between objects can be presented in two forms 
(see Table 4). 

Any items not underlined by the same line are significantly 
different at the .05 level. A scalability index at the .05 level i8 


C ud 
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TABLE 4 
Critical Ranges Found in the Scaling of Seven Statements by 94 Judges 


Items 7 6 5 4 3 2 1 
6 46 
5 84 38 
4 115* 79 41 
3 140* 94* 56 15 
2 192* 146* 108* 67 


52 
if 316* 270* 232* 191* 176* 142* 
* Significant at .05 level. 
254 097714 eg Be 


formed by 


_ No. of significantly different pairs 
II — 1) 
2 


SI 


The results in this case indicate that only three items can be 
considered significantly different. In a final scale illustrated be- 
low, the rescaling was accomplished by maximizing the distance 
between three significantly different items. 


Min 1 3 7 Max 
0 17 48 73 100 


lf some external criterion for judgment were available, then 
from those items which are not significantly different another 
might have been chosen which seemed to fit the situation better. 
For example, item six could have been used in the scale above 
instead of item seven. An experimenter who wished to use the 
Most centrally located of a group of items that are not signif- 
icantly different would have chosen items 1, 3, and 6. 

If standard scores are derived from the rank totals using E, — R/s 
and then divided by the largest possible score, and if the same pro- 
cedure is carried out for the scales derived from the Thurstone 
iue, a comparison between the two methods can be made (Bliss, 

960). The largest absolute difference between the Thurstone scales 
and the rank method is .051 (see Table 5). 
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TABLE 5 
Scale Comparisons of Korean War Statements 


Statement 1 2 3 4 5 6 


7 
Rank —1.448 —.487 —.086  .109 .349 .643 1.000 
Thurstone —1.491  —.462  —.058 .058 .318 .634 1.000 
Difference .043 .025 .028  .051  .041  .009 .000 


The Pearson correlation coefficient between the two scales is 
.9992, 

Another comparison of scaling methods is based on results of 
a taste-testing experiment (Hopkins, 1954). Hopkins used the 
Bradley-Terry Model to find expected frequencies with which 4 
particular sucrose solution was rated as sweeter than each of three 
others. Known concentrations of sucrose were added to tap water. 
Twenty independent replicate rankings of the sweetness of the 
solutions were made by six judges using the technique of paired 
comparisons. Graphical results of a comparison between the rank- 
ing technique, the Bradley-Terry method, and Thurstone’s Case 
V model are given in Figure 1, in which each scale has been 
transformed into a unit distance. Inter-correlations between the 
Rank, Thurstone, and Bradley-Terry scale values and the sucrose 
solutions were all > .9993. 


[7 Bi Ss S, 
l | | | 
+000 +325 .654 1.000 
Rank Scale Values 
S Sa Ss L^ 
| | l | 
+000 .914 .047 1.000 
Sucrose Solution Scaled by % Concentration 
S S: S: Se 
: | | | | 
.000 336 +627 1.000 
Thurstone Scale Values 
S B E Si 
I I l | 
000 .331 643 1.000 


Bradley-Terry Scale Values 


Figure 1. Comparison between the Thurstone, Rank and Bradley-TerY 
Scales and the Sucrose Solution Criterion. 
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t linearity of the Rank scale values with the actual 
neentration indicates that the Rank scales are an ac- 
lection of the judges’ ability to determine the amount 
jn solution. Both the Bradley-Terry model and Thur- 
scales are not quite as accurate a fit to the original sucrose 
as is the Rank scale. This is probably because of 
vity to the extreme frequencies found in the pref- 
When the original Rank values are rescaled using the 
| range table, all the values are significantly different at 
level and can be related to the extreme Rank totals. 


UN 

parisons of the Simplified Rank Method with other tech- 
‘indicate that most scaling methods yield similar results 
“Rankin, 1965, Bliss, 1960, and Jackson, 1957). The new 
appears to have particular value for the following reasons: 


is quickly and easily used. Of all the methods studied, the 
ethod appears to be the simplest and easiest by which 
stimulus-centered or response scales. 

allows the items to be scaled along a continuum with 
ul end points. The extreme ranks can serve as guide 
the scales that fall inside them, because the extreme 
esent the unlikely but possible agreement that could 


ds insensitive to extreme frequencies. Sometimes a sample 

e proportion will be 1 or 0, for which the normal deviates 
Or minus infinity. Compensating for such problems is 
the Rank Method. 

‘accuracy compares favorably with that of other tech- 


ows for tests of significance to be made between the 
ippears that such tests have been lacking in scale con- 
Sealing on the basis of tests of significance should 
r validity to the scale constructed. 

‘necessary sample size can be easily found. As a by- 
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product of the critical range tables and the normal range ap- 
proximation, the sample size necessary to assure that each item 
may be demonstrated to be significantly different is provided. 
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SOME POTENTIAL MODERATOR VARIABLES 
IN ATTITUDE RESEARCH? 


ALAN R. BASS and HJALMAR ROSEN 
Wayne State University 


Ons of the most perplexing problems in attitude theory and atti- 
tude research has been the problem of the relationship between atti- 
tudes and behavior. Traditionally, attitude has been characterized 
as a multidimensional construct, having affective, cognitive and be- 
havioral aspects, Rosenberg (1956) and Fishbein (1963) for exam- 
ple, have utilized both affective and cognitive components in their 
research, while Triandis (1964) has focused on the behavioral as- 
pect in his work on the behavioral differential. 

Research evidence, however, has indicated relatively little if any 
relationship between the affective and cognitive components of atti- 
tudes, on the one hand, and behaviors or behavioral intentions, on 
the other hand (Cf., Festinger, 1964; Fishbein, 1966). Studies of the 
relationship between publicly expressed attitudes toward an object 
and behavior toward that object have indicated little relationship 
between these variables (e.g., LaPiere, 1934). 

There are a number of possible explanations for this general lack 
of relationship between attitudes and behavior. Katz (1960; Katz 
and Stotland, 1959) for example, has suggested that behavior will 
not be related to an attitude unless the behavior serves some func- 
tional value relevant to that attitude. Similarly, Peak (1955), Ros- 
enberg (1956) and others have suggested that the "perceived instru- 
mentality” of a particular behavior is the critical variable in 
determining whether or not it will be related to attitudes. In a recent 
Paper, Fishbein (1966) has suggested two additional explanations 
for the general lack of relationship between attitudes and behavior. 

*We wish to thank Miss Sally corey for her assistance in administering the 


Questionnaires and facilitati ist S. S. Ki ita f 
ting the data analyses, and Dr. S. S. Komorita tor 
Comments on the manuscript. 
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First, he suggests that the attitude measured often refers to an in- 
appropriate stimulus object—i.e., a stimulus object which is not a 
relevant object with reference to specific behavior one wishes to 
predict. Second, Fishbein suggests that since any behavior is overde- 
termined psychologically, the particular behavior in which one is 
interested may be either wholly or partially unrelated to the atti- 
tude one chooses to measure. 

An examination of Fishbein's arguments suggests to the present 
writers that one critical variable which needs to be taken into ac- 
count in attempting to predict behavior from attitudes is the sali- 
ence or relevance of the stimulus object to the respondent. Thus, one 
reason why the stimulus object is often inappropriate or irrelevant 
to the attitude-behavior relationship may be that the stimulus ob- 
ject is a generalized class of people or a generalized issue which, in 
itself, may not be behaviorally salient to the individual. Thus, as 
Fishbein (1966) points out, attitudes toward “Orientals” as assessed 
in the LaPiere study, for example, may not be predictive of behavior 
to a particular Oriental Individual in a specific situation. The con- 
cept of “Oriental” may not be particularly salient, relevant, or 
motivationally involving, as such, to an individual, although a par- 
ticular Oriental individual with whom one has or may have inter- 
personal contact may be quite salient to the attitude respondent. 
Other theorists (e.g., Katz, 1960; Katz and Stotland, 1959; Smith et 
al, 1956) have also suggested the importance of the salience of an 
attitude object to the individual as a significant variable in attitude 
research. 

Unfortunately, however, most attitude studies have not explicitly 
attempted to incorporate the notion of saliency into experimental 
research paradigms, nor have any systematic attempts been made t 
assess the saliency of an attitude object for an individual. Since We 
are assuming that attitude object saliency will be an important 
moderator variable in the prediction of behavior from attitudes, the 
purpose of the present study was to attempt to construct a scale for 
assessing saliency of attitude objects. 

A search of the attitude literature would suggest that such terms 
as centrality, motivational involvement, and value valence seem to 
reflect the concept (saliency) to which we are here referring (8è 
€g., Krech, Crutchfield, and Ballachey, 1962 and Secord and Back- 
man, 1964). In addition, another set of related variables seems to b° 
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"characterized by such terms as complexity, multiplexity, and con- 


troversiality of the attitude issue. Finally, a third class of variables 
relevant to our construct of saliency seems to be such aspects of 
attitudes as certainty, confidence, decisiveness, and intensity. 

“In addition, some unpublished studies of behavioral predictions 
via attitude conducted by the junior author have suggested similar 
Kinds of variables as possible moderators of the relationship be- 
tween attitudes and behavior. Follow-up interviews with subjects 
whose behavior was incorrectly predicted from attitude measures 


- (affect toward the issue) yielded four general types of responses: 


(1) “I never really thought about this (issue) before the question- 
naire; it really didn’t concern me,” (2) “I’m really so unsure about 


‘it don’t know what I checked,” (3) “It’s confusing—there are so 
"many ways of looking at it,” (4) “It’s so controversial a topic I feel 


one way now and (a) different (way) tomorrow.” In general, these 
Tesponses suggest that the attitude object is a nonsalient one for 
these individuals, or that their attitudes toward the object lack 
crystallization. 
i If we accept one of the most parsimonious conceptualizations of 
the concept of attitude, viz., that attitude represents “the amount 
of affect for or against a psychological object” (Thurstone, 1931, p. 
261) as Fishbein has advocated, then variables such as those men- 
tioned above could be viewed not as additional attitude dimensions 
but as moderator variables in attitude research. Thus, in the case 
of behavioral prediction via attitudes, the relationship between 
Attitude and behavior might be moderated by such characteriza- 
tions concerning the focal attitude object or issue. In the case of 
Attitude change, these variables could possibly moderate the effects 
Of treatments on affect toward the attitude object. 
Most of the concern with the concept of attitude significance or 
Salience to the individual seems to have appeared in applied rather 
basic attitude research (e.g., Jurgensen, 1947; Porter, 1962, 
1063; Rosen, 1961). Doob (1948) for example, questioned the 
Meaning of opinion research under conditions where the respondents 
Were either unconcerned or uninformed about the issue probed but 
Nevertheless gave opinions. Certainty of response does seem to 
: ve been studied per se, but this research has generally tended to 
assume that certainty can be studied in terms of intensity of affec- 
Tesponse. Little attempt has been made, however, to develop 
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independent measures of such variables as intensity, saliency eto, 
and to determine their relationship with one another as well as with 
the affect dimension? 

The major purpose of the present research is to determine, first, 
whether or not such often proposed concepts as attitude saliency, 
certainty, and subjective confusion in terms of complexity and/or 
controversiality of issues can be measured adequately; and second, 
to determine whether or not these components are sufficiently inde- 
pendent of traditional affect, measures to provide additional mean- 
ingful attitude parameters that may be useful in attitude change 
and behavioral prediction research. Two separate but related stud- 
ies were conducted with these objectives in mind. 


Procedure: Study I 


On the basis of a thorough review of the attitude literature, defini- 
tions and discussion relevant to the major variables of concern to 
the present study were recorded. In turn, this material was reworded 
to make up simple statements reflecting these concepts and defini- 
tions, From the resulting list of statements, 40 were selected, in the 
estimate of two independent judges, that best characterized the key 
variables and were simple, easy to comprehend, and minimally 
ambiguous, 

These items were incorporated into a Likert-type questionnaire 
format. Response categories were provided based upon a six-point 
scale from “strongly disagree" to "strongly agree.” No neutral cate- 
gory was provided. In addition, a six-item evaluative semantic dif- 
ferential measure was utilized to obtain a measure of affect toward 
the issue used.? The six evaluative scales were summed to obtain an 
overall affect score, and these six scales were also scored for attitude 
intensity by scoring for extremeness of response without regard t0 
direction. 

In this first study, the stimulus issue used was: "Romney {0° 
Republican presidential nominee in 1968." The questionnaires Wee 


administered to 117 college sophomores in general psychology 
classes. 


2 See Weksel and Hennes, 1965, for a significant excepti 
, ption. 
.* The scales used were good—bad, approve—disapprove, desirable—unde 
sirable, positive-negative, pleasant—unpleasant and nice—awful. 
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Results: Study I 


A complete 42 x 42 intercorrelation matrix was obtained for the 
40 attitude items and the affect and intensity measures from the se- 
mantic differential. This matrix was then factored using the princi- 
pal components method and varimax rotation. Five factors were 
extracted, which accounted for about 52 per cent of the total vari- 
ance of the matrix.‘ The five varimax factors obtained are described 
below. 

Factor I, accounting for 16.5 per cent of the total variance, is 
most highly loaded on the following items: “I am in conflict about 
how I feel” (.82) ; “I just cannot make up my mind” (.80) ; ^I can- 

not decide how I feel about it” (.78) ; “I am pretty certain how I feel 
about it” (—.73) ; and “I have a definite opinion about it” (-.78). 
These items clearly reflect the degree to which the individual feels 
certain or is decisive about his attitude toward the stimulus object. 
This factor might tentatively be labeled an attitude confidence fac- 
lor. Interestingly, attitude intensity as obtained from the folded se- 
mantic differential scoring also loads substantially on this factor, 
suggesting that extremeness of response on semantic differential 
Scales does reflect, to a considerable extent, the degree to which the 
subject feels certain about or has a definite opinion about the atti- 
tude object (Cf., Weksel and Hennes, 1965). 

Factor II, accounting for 14.3 per cent of the total variance, loads 
on items reflecting motivational involvement or concern about the 
attitude object [e.g., “I am really quite concerned about it” (.70) ; 
“T spend a good deal of time thinking about it” (.69) ; “I am very 
Involved in this" (.69) ]. Neither the affect nor intensity components 
of the semantic differential scales were loaded on this factor, sug- 
gesting that the folded semantic differential score does not reflect in- 
tensity of feeling about the attitude issue but rather confidence 
about one’s feelings toward the issue (Factor I). This factor might 
tentatively be labeled a motivational involvement factor, corre- 
sponding to one of the functional components of attitude suggested 
by Katz (1960). 
ee III, accounting for 10.8 per cent of the variance, might be 

ed as an apathy factor in terms of its item content. It loads 


*The com J T ao 
plete varimax factor matrix may be obtained by writing the 

Eu 'artm: " * * * 

wae Ssi ent of Psychology, Wayne State University, Detroit, Michi- 
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most highly on the following items: “I have no feelings about it” 
(.76) ; “It is unimportant to me" (.70); “I have no interest in it" 
(.66) ; and “It does not concern me" (.58). It is interesting that this 
emerged as a separate factor from Factor II, since both seem to be 
relevant to motivational involvement. However, Factor II seems to 
be more of a cognitive centrality or saliency factor, reflecting pri- 
marily how concerned the person is about the issue, while Factor III 
is more of an emotional or affective commitment factor, being pri- 
marily concerned with the strength of the individual’s feeling about 
the issue. Thus, a person might be quite concerned about and spend 
a good deal of time thinking about the Vietnam war, for example, 
and yet he may not have strong feelings about it or it may not have 
a strong affective valence for him. Basically, Factor III might be 
considered to represent an “affective involvement" aspect, while 
Factor II seems to be more of a “cognitive involvement” aspect of 
an attitude. 

Factor IV is clearly concerned with the degree to which the issue 
is a complex or controversial one to the subject. Items loading on 
this factor are: ^I think the issue is highly controversial” (.78); “I 
believe others considered it to be controversial” (.76) ; and “I think 
it is a complex issue” (.56). This factor, accounting for 5.6 per cent 
of the total variance, might be labeled a complexity-controversial- 
ity factor. 

Factor V accounted for relatively little variance (5.2 per cent) 
and was primarily specific to attitude affect as measured by the se- 
mantic differential. As such, it was not germane to the present re- 
search concern. However, it should be noted that the semantic dif- 
ferential evaluative affect measure did not load on any of the other 
four factors obtained here. Thus, it seems clear that for the issue in- 
volved here, we are essentially measuring aspects of an attitude 
which are independent of traditional affect measures. 


Study II 


Given the results of Study I, described above, Study II had tw? 
major objectives. First, it was desired to determine whether the fat- 
tors obtained previously would hold up across issues. Second, this 
study undertook to investigate whether factor scores based on thes? 
factors would differentiate among issues in such a manner as to P10" 
vide useful and meaningful descriptive parameters for attitude studies. 
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unt: Procedure: Study II 


"A new format of the questionnaire was constructed for this 
study, utilizing only those items which loaded most highly on each 
of the first four varimax factors in Study I. This resulted in an 18 
item questionnaire (five items from each of Factors I and II, and 
four items each from Factors III and IV). Again, this question- 
naire was administered preceding the six-item semantic differential 
scale of the first study. The items used in this study are presented in 
Table 1. 
—RFor Study II three issues were selected as stimulus items, which, 
a priori, were assumed to differ on motivational involvement and 
possibly on the other factors as well. "My academic future" was 
used as one issue, assuming that this would be a concept of maximal 
personal involvement. “The quarter system,” a controversial stu- 
dent issue, was included as an important but probably less person- 
ally involving issue. Finally, “the farm subsidy program” was se- 
leted as an issue of low a priori involvement for a student 
population. 

TABLE 1 
List of Variables Used in Study II 


A. The following items were administered for each issue, using à Likert-type 
‘response format. 
_ 1. I bave a definite opinion about it. 
2. T feel that it has significant implications for me. 
3. I believe that others consider it to be controversial. 
4. T have no feelings about it. 
5. I am pretty certain how I feel about it. 
6. Lam really quite concerned about it. 
7. There are many sides to this question. 
8. It is unimportant to me. 
d i cannot decide how I feel about it. i 
a he a good deal of time thinking about it. 
y; th: Nee this is highly controversial. 
i E othing can be done about it. 
d am in conflict about how I feel. 
4 el very strongly about this. 
d think it is a complex question. 
6, It does not concern me. 
+ I just cannot make up my mind. 
X » 1 am very involved in this. 
r = following scores were obtained for each issue using the semantic-dif- 
. ‘erential format. 
. 19. Affect. 
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The experimental sample was made up of 170 college sophomores, 
Each subject was asked to respond separately to each issue in terms 
of the 18 item questionnaire and the semantic differential. Three 
separate 20 X 20 correlation matrices were generated (18 items plus 
semantic differential affect and intensity scores) and each was fac- 
tor analyzed yielding four varimax factors in each case. 


Results: Study II 


The varimax loadings for the 18 items on all three issues (along 
with the loadings for these items in the “Romney” analysis of Study 
I) are presented in Table 2. In general, it can be seen that the factor 
structures for the three issues in Study II are fairly similar across 
issues, and are also similar to the factor structure obtained for the 
“Romney” issue in Study I. In order to assess more precisely the 
extent of factorial similarity across issues, coefficients of congru- 
ence (Tucker, 1951) were obtained between the four factors across 
each of the four different attitude issues. These coefficients are pre- 
sented in Table 3. Here, it can be seen that the factor solutions for 
the four issues corresponded quite closely. The coefficients for cor- 
responding factors (e.g., Factor I in all four issues) were uniformly 
higher than any of the other coefficients for a given factor. How- 
ever, the factors for the “farm subsidy” issue do appear to be some- 
what less clear than those of the other issues, in that there is some- 
what more overlap for factors obtained for this issue with 
noncorresponding factors in other issues. This may well be due to 
the relatively low salience, for our subjects, of this issue, causing 
less clear differentiation of responses to the items for this issue than 
for the other issues presented. 

Inspection of the factor loadings in Table 2 indicates that again, 
as in Study I, the semantic differential intensity measure is most 
highly loaded on Factor I—the confidence factor. However, unlike 
the results of Study I, the affect measure also loaded highly on Fac 
tor I for the “academic future" issue and on Factor II for the “qual 
ler system" issue. 

In order to obtain some indication of the meaningfulness, or Con" 
struct validity, of the factors obtained here, the attitude issues We! 
compared in terms of mean factor scores. It was hypothesized tha 
the “academic future” issue would be most salient for these subjects 
—i.e., most involving, of most concern, while the “farm subsidy” i5- 
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sue would presumably be least significant to these subjects. Factor 
scores on each of the four factors for each issue were computed for 
each subject by summing the item scores for the four or five items 
which most clearly defined each factor. (The items scored for each 
factor are indicated in Table 2.) The factor score means and stand- 
ard deviations, along with the semantic differential affect and inten- 
sity score means for each issue in Study II, are presented in Table 4 
Here it can be seen that the largest differences between the issues 
occurred for Factors II (Motivational Involvement) and 
(Apathy), with the differences in the predicted directions. The is- 
sues also differed on Factor I (Confidence) in the predicted direc 
tion, although the mean differences for this factor were relatively 
small. All differences between factor score means were statistically 
significant beyond the .05 level, as were differences between specifi 
issues, as predicted. The issues also differed significantly on Facto 
IV (Complexity), with the “academic future" issue perceived 88. 
significantly less controversial or complex than either the farm sub- 


TABLE 3 
Coefficients of Congruence 


Academic Future 
II IH 


Quarter System 
IV II Ir 
I 1.000 0.405  —0.410 —0.398 0.777 0.177 —0.286 
II 1.000 —0.286 —0.026 0.482 0.711 —0.243 
II 1.000 0.120 —0.231 —0.342 0.858 
IV 1.000 —0.358 —0.100 0.004 


I 1.000 0.447 —0.220 
1 
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TABLE 3 (cont.) 
E ——M———MMÀ— 
sdemic Academic Future Quarter System 
Future I II II IV I I IH IV 


0.20 0.555  —0.328 0.051 0.887 0.486 —0.220 0.099 
0.078 0.769  —0.300 0.189 0.408 0.862 —0.242 —0.109 
[ — —0.828 —0.591 0.900 —0.142 —0.357 —0.582 0.860 0.235 
—0.437 —0.000 0.053 0.675 —0.401 —0.069 —0.072 — 0.682 


uarter 
System 
0.686 0.598 —0.199 0.119 0.891 0.581 —0.026 —0.094 
0.260 0.804 —0.246 0.312 0.415 0.725 —0.376 0.037 
—0.133 —0.621 0.794 —0.211 —0.278 —0.582 0.741 0.032 
i —0.188 0.166 —0.014 0.921 —0.196 0.045 —0.179 0.777 
im 
Subsidy 
1.000 0.385 —0.148 0.054 0.802 0.241 —0.147 —0.046 
1.000 —0.456 0.339 0.645 0.866 —0.572 0.035 
1 1.000 —0.099 —0.217 —0.604 0.707 0.234 
1.000 0.102 0.173 —0.207 0.769 
mney 
| 1.000 0.511 —0.215 —0.026 
1.000 —0.439 —0.161 
1.000 0.061 
1.000 


Note 5 
Coefficients for corresponding factors from different issues are underlined. 


sidy or quarter system issues. Also consistent with the factor analy- 
fit results, the three issues differed significantly, in the predicted di- 
p on the semantic differential intensity measure (academic 
ae eliciting the highest intensity scores, the farm subsidy the 
owest). Thus, the stability of factor structures across issues, as 
Vell as the confirmation of anticipated results concerning differences 
between factor score means between issues, lends considerable sup- 
port to the meaningfulness and interpretability of the attitude di- 
Mensions obtained here. 
oe examination of the attitude scores lends even more eredi- 
DOM Hi the validity of the factors obtained here. On the basis of the 
dia ntic differential data across the three issues, one would con- 
e that the subjects felt highly positive toward their academic 
whereas both the quarter system and the farm subsidy is- 
€ evoked relative neutrality. However, if we examine the inten- 
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TABLE 4 
Factor Score Means and Standard Deviations for the Attitude Issues 


Issue 
Academic Quarter Farm 
Future System Subsidy F p 
Factors 
I Confidence X 17.62 16.72» 16.48> 9.17 «.00 
8 2.76 1.97 2.98 
II Motivational X 18.878 14.42% 9.74° 232.22 <.001 
Involvement $8 3.26 4.51 3.92 
III Apathy X 6.15" 7.98» 11.97? 116.48 «001 
8 2.47 2.97 3.88 
IV Complexity — X 15.36 18.30» 18.27» 43.09  «.001 
S 3.66 3.32 3.12 
Semantic Differential 
Affect, X 235225 22.31* 20.12. 111.00 <.001 
S 5.95 11.17 6.62 
Intensity X 13.20 10.97» 5.98° 100.04 <.001 
Bom 5.23 4.92 


Note—Means with different superscripts in a given row differ from one another at the .05 leve 
or better, according to the Newman-Keuls test. 


sity data, as well as the variance of the affect scores, we can con- 
clude that, for the group, the response to the farm subsidy issue 
indicated true neutrality, i.e., lack of directional affect, whereas for 
the quarter system issue the apparent neutrality was an averaging 
artifact. In the latter case, relatively intense positive and negative 
responses were averaged out (Cf., Fig. 1). 

Considering the nonaffective factor score data, although all fac- 
tors statistically differentiated responses to the issues, more thor 
ough evaluation provides greater insight into the nature of these 
factors. Figure 1 presents the factor score means on a scale relative 
to the possible score range for each factor. In terms of scale Te 
sponse potential, i.e., the response range permitted, the subjects 
were only moderately certain about the three issues (Factor 1). 
Motivational involvement (Factor II) was more differentiating, 
but none of the issues evoked particularly high motivational Ms 
volvement. More important was the absence of involvement partit" 
ularly with regard to the farm subsidy issue. Apathy (Factor 10) 
did not strongly characterize response to any of the issues, but it 
was lowest (as expected) for the academic future issue. Finally; 
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Motivational Controversiality= 
Certainty Involvement Apathy Ü'Qomledty Affect puna 


foli xe my academic future 


O= the quarter system 
B= farm subsidy program 


Fi . 5 ial. 
Are 1. Attitude component mean scores presented in terms of scale response potential 
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subjects indicated that they felt that all of the issues were rela ti 
complex and controversial—being somewhat less apt to characi 
a highly personal issue on this dimension, i.e., academic future, thi 
the more abstract and social issues presented. 


Discussion 


Taking an overview of the two studies, it readily becomes ap 
ent that the four nonevaluative dimensions obtained here have p 
tential utility. They are generalizable across issues, and they 
criminate among issues. However, it appears that the confidi 
factor may not be particularly parsimonious, since, in gener 
appears to be highly related to attitude intensity as measure 
folded semantic differential scores, which can be derived 
simply. 

In general, the factors obtained here appear to be relatively in 
pendent of attitude affect as measured by the semantic differenti 
Two exceptions were noted, however, viz., for Factor I (Confide 
for the academic future issue, and Factor II (Motivational Involy 
ment) for the quarter system issue. Both of these factors had 1b 
stantial loadings on the semantic differential affect measure. In t 
case of the academic future issue, the loading of affect on the 
fidence factor is clearly an artifact due to the high correlation be 
tween affect and intensity scores derived from the semantic diff 
tial for this issue (r = 83). In this case the affect scores were 
understandably, practically all restricted to the positive end of 
scale, so that there was necessarily a high correlation between al 
and intensity. 

The relatively high loading of affect on the motivational involv 
ment factor for the quarter system issue does appear to be a SUD. 
stantively meaningful exception to the relative independence 0° 
affect and the present attitude factors. For this issue, positive affect 
would indicate support for a status quo situation, in which personal 
involvement would then be expected to be minimal. On the other 
hand, negative affect would indicate desire for change, in wh 
case motivational involvement would be expected to be greater 5 
other issues studied here were either of minimal subjective si 
cance to our subjects (i.e., the farm subsidy issue) or did not h 
the status quo character (i.e., Romney, academic future). Thus, ™ 
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both cases of apparent factor contamination with affect, the dis- 
crepancies can not only be explained but also may have utility for 
further research as well. 

In terms of item content, Factors II and III would appear to be 
measuring opposite ends of the same continuum. However, not only 
did these factors emerge as independent across issues and subjects, 
but also they are only moderately negatively intercorrelated. 

The apparent paradox of Factors II and III being independent 
in spite of ostensible similarity in item content cannot be defini- 
tively explored within the scope of this study. Two interrelated 
clues, however, provide a basis for speculation and further research. 
The first relates to the fact that the motivational involvement items 
(Factor II) were stated positively and the apathy items (Factor 
III) negatively. The second clue relates to the fact (Table 4) that, 
regarding motivational involvement, subjects showed considerable 
variability in accepting or rejecting such a characterization of their 
Tesponse across the three issues probed. With regard to apathy, how- 
ever, even though the issues were significantly differentiated, sub- 
jects on the whole did not describe their reactions to any of the is- 
Sues as being very apathetic. 

Perhaps the most parsimonious explanation involves the effects of 
Positive or negative phrasing of the items within the two factors. It 
may be that it is more socially acceptable to indicate that involve- 
ment does not characterize one’s response to an issue than that non- 
Involvement or apathy does. If this is so, then one would expect to 
find less variability on the apathy dimension and more on the in- 
volvement dimension, with a relatively low correlation between 
them. This suggestion seems to be consistent with the data for 
Factors II and ITI in Table 4. 

An alternative explanation for the relative independence of 
*pathy and involvement items in this study concerns the particular 
scaling method used to assess the factors here. It is quite possible 
that if a bipolar scale of the semantic differential type were used to 
assess these attitude components rather than the Likert-type scales 
nee used, the apathy and involvement items would coalesce 

^ single dimension. Thus, if scales such as involved—nonin- 
Yolved or interested —uninterested. were used, forcing the subject to 
make a dichotomous decision, then these scales might be more 
ghly correlated than when separate items are used for each pole 
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(i.e., I am very interested in this, I am uninterested in this). Such a 
speculation obviously requires further research. 

Insofar as the dimensions obtained in the present study are mean- 
ingful and consistent across issues, it would seem that these attitude 
components may have potential utility in studies of attitude change 
and behavior prediction from attitudes. Specifically, it might be 
hypothesized that attitude issues for which subjects are more cer- 
tain and decisive, and for which subjects have greater motivational 
involvement, would be less susceptible to change than issues which 
involve less decisiveness and involvement. 

A recent doctoral dissertation (Graham, 1966) utilized a modified 
scale base upon Factor II (Motivational Involvement) as a variable 
influencing the effectiveness of treatment in inducing attitudinal 
change. Significant differences in the effectiveness of treatment to 
induce change were attributable to the degree of motivational in- 
volvement, i.e., the higher the initial involvement with the issue, the 
less effective the treatment was in inducing change. 

Research is currently underway investigating the utility of these 
factors as moderators of behavior predictions via traditional affect 
attitude measures. It is hypothesized that behavior prediction 
would be more effective for those attitude issues which were more 
involving and for which subjects were more decisive. In fact, it 
would be of interest to compare the predictive efficiency of affect 
measures along with decisiveness and involvement measures as ad- 
ditional predictors of behavior. A regression equation, for example, 
might be developed to determine importance or contribution of these 
three attitude components as predictors of an appropriate behavior 
for the issue involved. 

The results obtained in this study suggest that, descriptively, af 
fect does provide only partial insight into the response to attitude 
issue stimuli, and can be implemented meaningfully by utilization 
of the non-affective factor scales obtained here. Perhaps of more sig- 
nificance is the contribution to research efforts and analysis of atti- 
tude change and behavioral prediction possible by utilizing the 
nonaffective scale data. It seems clear that future attitude studies 


, 5 Data from an ongoing doctoral study provide evidence that motivational 
involvement is also itself a significant predictor of behavior, in a Te 
paradigm in which the subject has the option of engaging in a positive act 
negative act or in no activity re the issue. 


BASS AND ROSEN 347 


will benefit greatly by a more refined conceptualization of attitude 
dimensions and components such as described in this research. 


Summary 


Forty items concerned with nonaffective dimensions of attitude 
were administered to a large sample of college sophomores in the 
setting of a political issue. A factor analysis of the items resulted in 
four major factors which were independent of affect, viz., certainly, 
motivational involvement, apathy, and complexity-controversial- 
ity. Selecting those items from the original 40 that were both fac- 
torially pure and most heavily loaded on their respective factors, a 
new 18 item scale was developed. This scale, plus a six-polarity se- 
mantic differential, were then administered to another large sample 
of college sophomores probing three issues that on an a priori basis 
would appear to differentiate in terms of motivational involvement. 
Further analyses indicated that the original factors were quite con- 
sistent across the three issues. Moreover, the three issues, viz., “my 
academic future,” “the quarter system” and the “farm subsidy 
program” were compared in terms of scores derived from the items 
loaded on each factor. Results were in the expected direction and 
Significant. Given the additional information on the nonaffective 
dimensions, far more insight into the nature of the subjects’ response 
to the issues was possible than that derived from the pure affect, 
Measure, 

Aside from demonstrating some validity for the hypothesized at- 
titude dimensions suggested in the literature, the findings provide a 
basis for better understanding (and more refined research strate- 
gies) into the dynamics of both attitude change and behavioral pre- 
diction on the basis of attitude data. 
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THERE are many instances where individual and group forms of a 
test may be correlated. The group form score is the number of 
items completed in a specified time period while the individual 
form score may be the time taken to compete a fixed number of 
items, At first glance these two scoring techniques appear identical. 
However, even if two tests were to vary identically it would be im- 
Possible to obtain a 1.00 correlation because of the nature of the 
Scoring units, 


Analysis 
Individually administered forms involve the presentation of a 
constant number of items (Nz); the time (Tz) taken to complete 
Ni is variable. For the group form, a constant time (Ta) is used 
and the number of items completed (Ng) is variable. Thus, 
Nr _ No (2) 
T, Ts 
ur partieular comparisons Nz and T's are constants and therefore a 
Yperbolie functions results. 


N;T nstant 
UN — Ns, or Ss = Ne Q) 


i Pearson r is computed to give a straight line fit, a negative cor- 
ation which can never equal —1.00 is the result. 
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In a specific instance (Jackson, Messick, and Myers, 1964) a 
group form (G) of Witkin’s Embedded Figures Test (EFT), (Wit- 
kin, Lewis, Hertzman, Machover, Meissner, and Wapner, 1954) was 
compared to an individually administered form (I). By dividing I 
in half, and making from J an J, 1st and 2nd half, and a G, 1st and 
2nd half, it was possible to use different parts of the same test for 
the I-G comparison. For I, Ny = 12 (items) and for G, Te = 10 
(min.) in this case. By substitution, 


T, = —— (3) 


which is a hyperbolic function. (Formula [3] can be used to con- 
vert scores to a linear function in this case.) 

From the mean and standard deviation of Ng found by Jackson 
et al. (1964) it is possible to estimate the portion of the hyperbolic 
function which their data covered. Substituting eight data points 
in equation (2) (between = 2¢ around the mean) yielded a Pearson 
r of —.92 and a linear regression equation of Y = —.6X + 20. The 
data reported by Jackson et al. (1964) yielded an equation of Y = 
—10X + 18. 

To determine the upper limit for the Pearson r, (y = x being the 
perfect case), c/z was substituted for y. The constant drops out, 
meaning that the upper limit is dependent on the range of scores, 
not the curvature of the function. The Pearson r reduced to: 


n = lI) (4) 
0501/7 
Integrating the equation and expressing it in terms of the range of 
Scores, a and b, where b is the lower limit, a is the upper limit, and 
where z — a/b yields: 


ns 2V3(1 — 1/2(a + b/a — b) In ı a/b) (5) 
[a — b)"/ab — (In a/b) ] 7 
n = V3 = @ + 1/2 — 1) nz] 6) 


(@ — 1/z — (In zy] 
In the previous example the eight data points within -+20 around 
the mean yielded a correlation of —.92. The formulas above where, 
a/b = 15/3 = 5 = z, yielded a — 93. Figure 1 shows the upper limit 
for differing values of z. 
Jackson et al. (1964) subtracted a constant from each time sc0T6 
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so that correlations would be positive. In the perfect case this would 
be, y = C — c/z. Substituting into the Pearson r formula (again 
the constants drop out) yielded: 


which is equal to equation (4) except the sign is changed. Since 
correlations are of the same magnitude at a given range of scores, 
formulas (5) and (6) also apply to this case. 


Discussion 

It can be easily seen that the use of comparable scoring units is 
important. Jackson et al. (1964) found correlations of .56-.84. By 
using .93 as the approximate upper limit in this case, these corre- 
lations should increase to .61—.91 if reciprocal data transformations 
of the group scores were used. This procedure would decrease the 
error variance by 6-14 per cent. 

Certain implications for future research are immediately evident: 


1. It should be determined if time (T/N) or rate (N/T) scores 
are being used. 


ee 
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2. Alternate forms with these scoring unit differences necessitate 
reciprocal data transformations. 

3. A criterion found to correlate (linearly) with time scores ne- 
cessitates reciprocal transformations of predictor rate scores 
before correlation analysis is made. 

4. Conversely, a criterion found to correlate (linearly) with rate 
scores necessitates reciprocal transformations of predictor time 
scores, 


An example of how three above should have been applied can be 
given. Recently, (Barrett, Cabe, and Thornton, 1968) a group form 
Hidden Figures Test (HFT) supplied by Educational Testing 
Service was compared to Rod and Frame Test (RFT) scores. Linear 
correlations were low (—.06 to —.47). The first conclusion was 
that the HFT was markedly different from the Witkin's original 
individually administered EFT since he found EFT-RFT correla- 
tions of .43-.76. Barrett et al. (1968), however, performed several 
data transformations and found higher correlations. One of these 
happened to be the correct y = l/z relationship. Correlations 
ranged from .36 to .65 being quite consistent with Witkin's EFT 
findings. 

Perhaps it would be profitable for investigators to reanalyze their 
data before they conclude they have nonsignificant results. Espe- 
cially, since group forms are generally utilized in large scale re- 
search efforts, many important findings may await only simple 
reciprocal data transformations. 
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‘A COMPARISON OF BISERIAL DISCRIMINATION, 

7 POINT BISERIAL DISCRIMINATION, AND 

_ DIFFICULTY INDICES IN ITEM ANALYSIS DATA 
LAWRENCE M. ALEAMONI AND RICHARD E. SPENCER 

J University of Illinois 


Correzation has been established as a useful device when one is 
sted in determining whether or not a relationship exists be- 
en two measures or variables. The formulation correlation takes 
enced by the format of the data as well as by the assump- 
particular user is willing to make. In item analysis, for exam- 
the user is usually faced with dichotomous and continuous data 
Which he would like to determine degrees of association. Al- 
ugh there are many formulations of correlation applicable to 
n-type data, the biserial and point biserial coefficients are the 
3 popular. 

hen analysing a particular examination via item analysis data, 
crimination and difficulty indices are among the most impor- 
t statistics required. The biserial and point biserial correlations 
Usually defined as discrimination indices and the proportion 


E much discussion about the assumptions underlying them as 
as the relationship of each to item difficulty. Comments by 
1S such as Gulliksen (1950), Carroll (1945), Adams (1960), and 
ord (1954) indicate that biserial correlation is essentially un- 
E item difficulty, while point biserial correlation is much ' 
highly related. Although the criterion score distribution will 
3 3 effect on the discrimination indices, the above authors felt 
a Point biserial r would, consistently, yield the’ highest degree , 
Eos to difficulty level, especially for items of moderate 
E ly. This argument implies that the relationship between the 
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biserial and point biserial correlation coefficients should not be very 
high. It further implies that each of the discrimination coefficients 
should be quite differently related to a common difficulty level, 
However, Engelhart (1965) found the correlation between the bi- 
serial and point biserial coefficients (estimated by an abae which re- 
quired only the upper and lower portions of the score distribution) 
to be very high, approximately .98. 

The purpose of the present study, therefore, was to investigate, 
empirically, the relationship of biserial and point biserial discrim- 
ination indices to the difficulty index as well as to each other. 


Method 


The Modern Language Association-Foreign Language Tests in 
reading and listening comprehension were administered to all en- 
rollees (4,300) attending the first four semester courses at the Uni- 
versity of Illinois in French, German, Russian, and Spanish for the 
1965-66 school year. Each language test had two levels for both 
reading and listening. The lower level test form LB was admin- 
istered to the 101 and 102 level courses. The upper level form MB 
was administered to the 103 and 104 level courses. 

An item analysis program was written for the IBM 7094 com- 
puter which provided the following item statistics: 

1. Difficulty index = the proportion of subjects passing each item 
=p. 

2. Biserial r discrimination index = (M, — M,)/o.-p/y = ^s 

3. Point biserial r discrimination index = (M, — M;)/a,- /p/d = ™ 
where M, = mean score of those subjects passing each item. 

M, = mean score of all subjects on all items. 

q = proportion of subjects failing each item. 

Y = height of the unit normal curve at the point of the dichotomy. 


Correlations among the three item statistics were computed " 
order to examine their relationships on data derived under norm? 
conditions with standardized examinations. 


Results 
bi T correlations between the biserial and point biserial item dis- 
crimination indices are presented in Table 1 for each set of exam 
inations. The values range from a low of 92 to a high of .99. The 
correlations for the Russian group were the lowest, but it also ha 
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| the least number of students. A mean correlation of .97 was ob- 
tained. 

The correlations between the difficulty index and each of the dis- 
trimination indices are presented in Table 2 for each set of exam- 
inations. Inspection of this table indicates that the relationship of 
the biserial and point biserial to the difficulty index are quite con- 
sistent and similar in the magnitude, sign, and significance level of 
their correlation coefficients. 


Summary and Conclusions 


For the set of examinations used in this study it appears that the 
choice of a biserial or point biserial item discrimination index would 
yield the same rank ordering of items. This finding supports simi- 


TABLE 1 


Correlations between. Biserial and Point Biserial Item Discrimination 
Indices on a Set of 16 Standardized Ezaminations 


Lar Number Number 
LB 50 .96 719 27.98 8.31 
ou 7.76 
Trench 50 .96 863 25.61 F 
; LB 45 98 720 21.08 6.29 
Listening 
LB 50 98 873 27.08 8.06 
Reading V 
German MB 50 97 487 27.88 
: LB 45 98 855 21.33 8.14 
Listening 
= MB 40 .98 487 18.80 7.49 
LB 50 .97 17 21.43 10.58 
Reading 
Russian MB 50 —.99 s9 21.42 10.04 
; LB 45 .99 147 19.32 7.12 
Listening 
LB 50 .98 643 24.00 8.39 
Reading Vi 
Spanish MB s OBR ae Mises S DS 
LB 45 — 99 657 19.64 6.51 
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TABLE 2 


Correlations between the Dificulty Indez and the Biserial and Point Biserial 
Discrimination Indices on a Set of 16 Standardized Examinations 


Number 
Language Test Form of Items Te Tob 
O Oena O o 76 l 
d LB 50 2x01 =.19 
Reading | 
MB 50 —.08 -u 
French 
LB 45 30* 28 
Listening 
MB 40 .05 AT 
LI EMEND 60 05 UN 
h LB 50 —.03 -48 
Reading 
MB 50 09 -.08 
German 
45 —.26 —.25 
Listening 
MB 40 .14 20 
OSRAM MRNMINIMBION 2, 40.5. 5 14 108 
LB 50 —.16 .06 
Reading 
MB 50 —.02 06 
Russian 
x1 LB 45 .05 att 
istening 
MB 40 .31* A 
ER ee eC EES UM 10 oo.mpm o Hes 
LB 50 .00 05 
Reading T 
MB 50 46** S 
Spanish 
LB 45 08 09 
Listening 5 
MB 40 my 2 
Moy, aa i oL 


Significant at .05 level. 
** Significant at .01 level. 


lar results reported by Guilford (1954) and Engelhart (1965). The 
high linear relationship between the biserial and point biserial in- 
dices indicates that either coefficient could be used, equally well, a8 
an item discrimination index, unless one is interested in the level of 
significance of the index. 

Actually, the lack of a perfect linear relationship between the 
biserial and point biserial indices is a result of their perfect curvilin- 
ear relationship formulated by 


T T 
"vp 
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Figure 1 indicates how this curvilinear relationship would look if 
(M, — M,)/o: was fixed thereby setting t» = V/p/q and ny = ply. 
The values for ~/p/q and p/y would, therefore, range from .10 to 
9.95 and .37 to 37.08, respectively. 

The set of points to which the curve was fitted in Figure 1, by the 
least squares method, can also be fitted by a straight line the slope 
of which is very nearly 1.00. For this reason, the linear relationship 
between the rj» and rp as portrayed in Figure 1 would be very high. 

The linear relationship of each of the discrimination indices to the 
difficulty index yielded such consistent results that there was no 
support found for the hypothesis that the point biserial is more 
highly related to item difficulty than the biserial. 

It appears, therefore, that the high degree of relationship between 
the biserial and point biserial discrimination indices as well as 
their highly consistent relationship to the difficulty index, argues 
against selecting one or the other for item analysis, (except when 


37.08 


037 
-10 9.95 


Tpb 


Figure 1. Graph of r, versus rps when (Mp — M,)/sr is fixed. 
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the level of significance is desired) especially if the difficulty level of 
the items is a consideration. 
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RELIABILITY OF MULTIPLE-CHOICE TEST SCORES 
IS NOT THE PROPORTION OF VARIANCE 
WHICH IS TRUE VARIANCE 


ROBERT B. FRARY: 
The University of Miami 


Tu basic problem of reliability is to develop a measure to reflect 
the degree of likelihood that scores will be similar on two adminis- 
trations of a test. Such a measure is the correlation between the re- 
sulting scores, which under certain definitions and assumptions can 
3 shown to be equal to the proportion of variance which is “true” 
variance. One such line of reasoning defines an individual’s true 
Score as the limit approached by the average of his scores on an end- 
less series of parallel tests (assuming such a limit exists). If the tests 
^re multiple-choice with no correction for guessing, an individual 
Who knows nothing in the area tested will have a positive “true” 
score, In another derivation, true" score is defined as the difference 
between total score and “error” score, where “error” scores are as- 
sumed to have a mean of zero. Here it is not immediately clear 
Whether the part of an individual's score due to successful guessing 
peony or otherwise) is included in “true” score, the positive 
omponent of “error” score, or partly in both. 
ite iu purpose of this paper to show that in general, if true 
due t, on a multiple-choice test does not include the portion of score 
not i Successful guessing, the correlation between parallel forms is 

quivalent to the proportion of variance which is true variance. 

it her, it will be shown that under these conditions, defining relia- 

iis, i the ratio of true variance to total variance leads to incon- 
cies and absurdities. 

er 


Ec 


" - 
tay] WlUhor wishes to express his thanks to Profesor John R. Hills for 
elpful suggestions concerning the preparation of this paper. 
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Variance of Multiple-Choice Test Scores 


On a multiple choice test, the raw score (X) of an examinee may 
be expressed as the sum of the number of items known or true score 
(T), the number of items guessed correctly (G), and an error term 
(E). 


X=T+G+4+E8 (18) 
In deviation scores the same relationship is 


Z=itgt+e. (1b) 


Here E represents random error such as that caused by clerical er- 
rors or transitory environmental conditions. It is assumed that E = 
0 so that e = E. Under these conditions, the variance of a set of 
scores may be expressed as 


c) = c^ o + 04 + (ruso, + P00. + tuo.) Q) 


Since error scores are assumed to be random, fte and rj, are zero. 
Then (2) may be written as 


Ge = n + 0,7 o + ruso, 9 


In some cases 7, may be approximately zero. Swineford (1941) 
and others have found that propensity to guess and hence to m- 
crease guessing score is, under some circumstances, unrelated to 
ability as measured by the test given. However, there are plausible 
arguments that in many cases Ttg 7 0. If the examinees know there 
is no penalty for guessing and hence mark every item, ry; May 5 
negative due to higher guessing scores for those who know least. 02 
the other hand, r: can be positive. If those who know more atè 
more adept at utilizing partial information or clues inadvertently 
furnished by the item writer, they may have higher guessing scores 
than those who guess randomly or who in their ignorance may be 
intimidated more easily by admonitions not to guess. 


Tt Positive 
Suppose that for a set of scores ry is positive. To apply the pres 
portion of variance concept of reliability, i& must be decide i 
whether guessing variance is "nontrue." It may be the case tha 
Tig > 0 is the result of judicious use of partial information by 
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ter prepared examinees. Thus the following expression might be 
suggested for reliability under the proportion of variance concept. 


nds 2 a u0, 
ra = Sep = Eeee (4a) 
On the other hand, r: > 0 may result from some extraneous factor 
such as general intelligence. Then the expression suggested under 


the proportion of variance concept is 
MUI. (4b) 


Neither (4a) nor (4b) is equivalent to a correlation coefficient be- 
tween parallel forms. To show this nonequivalence an expression 
will be derived for the correlation between scores on parallel forms. 


Correlation between Scores on Parallel Forms 


For deviation scores z = t + g + e, the correlation between mist | 
lel forms a and b is by definition 


Xung, 


Lm 
Notz, No.02, 
E ttt Xt Xie 260 
Et 22 (e, + e.g. + ego tetas + ZAN 


N [ET 
o parallel forms refers to forms such that each examinee has 
E true score on both, total variance is the same for both, and 
form Scores on one form are uncorrelated with true scores on either 
à or with error scores on the other (Gulliksen, 1950). Error 
ores will also be assumed to be uncorrelated with guessing scores. 


cad these assumptions the last term in the numerator above is 
0, 


2 
pU IE a 
* =o," = o? and c, = az," = c^, 80 that 


2 
oe? + PreosFeFos F TussTiTos T Tooga 
2 
oz 


r. = 


E 
Fi M 
P onu it is reasonable to assume that rj, = Tu» T "io and 
= 2 2 

975 — c,. Then 


nuo trant Tati ©) 
Oz 


2 
O? + 2r010s F Toata 
2 
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It may be noted at this point; that (5) is equivalent to neither 
(4a) nor (4b). In fact the question of whether guessing variance is 
"nontrue" is not really relevant to the determination of the correla- 
tion between parallel forms. 


Ttg of Zero 


Even if r, — 0, the parallel forms and proportion of variance 
definitions are not necessarily equivalent. Personality characteristics 
may make some examinees’ guessing scores consistently higher than 
others. Torrance and Ziller (1957) note this phenomenon and report 
the persistence of guessing by some individuals even when penalties 
for guessing are severe. Under such circumstances r,,,, > 0 may be 
expected even if r,, = 0. Then the correlation between parallel forms 
(5) will be greater than the proportion of variance which is true 
variance. 

Only if r = O and r,,,, = 0 will (5) be equivalent to the proportion 
of variance which is true variance. 


Tio Negative 

The nonapplicability of the proportion of variance definition is 
particularly evident when Ti; < 0. For example, it may then be the 
case that og? + 2rigaya, + o? < 0. As a result (see equation 3) 
9? > oj? so that c;2/s,? > 1. Table 1 gives hypothetical true, guess- 
ing, error, and total scores for four examinees on a three-choice, 
twelve-item test to illustrate that this siuation can arise under not 
too far-fetched circumstances, 

Even if fiy < 0 and 2rooy + 0,2 + o, > 0, it is still possible 
and even likely that 2rig71% + oy < 0. In this case, with ry < 0 
surely guessing variance is “nontrue” and better eliminated. Upon 


TABLE 1 
Hypothetical Scores for Four Ezaminees on a Three-Choice, Twelve-Item Test 
True Guessing Error Total 
Score Score Score Score fee = 4 
Te = 
3 3 1 7 Tu zat 
6 2 -2 6 id 
9 1 1 1 1 ez 


1 
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elimination of guessing variance, reliability should rise. However, 
for 2rymo, + 0,2 < 0, elimination of guessing variance will reduce 
reliability under the proportion of variance definition. Originally 
the proportion of variance which is true variance is 

c. 2 
t 

o) F ol F 2r, + Fe © 
After elimination of guessing variance 


Tao = 


2 
ry duat 
fai p Hi PS (7) 
If 250,0, + 0,2 < 0, the denominator of (7) is larger than that of 
(6) and rs < Tee. 
Itis not only possible but likely that 2740,07, + og? < 0 because 
this inequality is equivalent to 


n, < —3(o,/a:)- (8) 
tt % € 01/2, as would be the case in most practical testing situa- 
tions, rij need be no smaller than —1/4 to satisfy (8) and hence to 
have 20, + o? < 0. If each examinee guesses randomly on all 
items not known, ftg < —1/4 is quite likely. 


Correlation between True and Total Scores 


Mt the scores have no guessing component, Tis? = rs, because i 
ise d Tos = 02/022, and the square of the correlation coefficient 
oak to the proportion of total-score variance accounted for by 

e scores. Thus it is of interest to derive an expression for 


fy : 
i the case in which the guessing component of scores is not 


By definition 
hie et ee oH es ee 
Novo, No,c, i. Noc. (9 


= ge + Tig040, + Tree 
Y 9,0, i 
oe the usual assumptions regarding error scores, Tte = 0. Thus 
) reduces to 


noe zi runn otua AO) 
0,0, 9. 


OO -——— " 
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so that 


2 2.2 
a 0v + 2ruo,0, + 7,0, / 
2 


Oz 


(11) 


Tiz 


From (11) it is seen that rj? is equivalent to neither the corre- 
lation between parallel forms (formula 5) nor one of the "propor- 
tion of variance" expressions for reliability (formulas 4a and 4b). 
Thus, in view of the earlier comparison of formula 5 with formulas 
4a and 4b, no two of the three approaches to defining reliability are 
equivalent in the case of a multiple-choice test on which the exam- 
inees guess. 


Implications for Test Theory 


Test theory is most often applied in cases where the scores are 
from multiple-choice tests, Tnstructions to the examinees regarding 
guessing and examinee characteristics affecting guessing behavior 
vary from one situation to another. However, the statistical treat- 
ment of the scores in many instances does not take these differences 
into consideration. For example, simple comparisons between 
scores of well prepared and poorly prepared groups are misleading 
if Ty is positive for the first group and negative for the second. 
Also, validity coefficients are affected by the guessing component of 
scores. If it is desired only to estimate predictive efficiency of scores, 
consideration of raw scores alone may be sufficient. However, if the 
investigator is seeking to determine true score relationships the ef- 
fect of guessing must be taken into consideration. 

With regard to various test parameters, the effects of eliminating 
the guessing component of scores, by whatever means, have not 
been fully investigated. For example, the effect of such elimination 
on reliability and validity would depend on relationships between 
the guessing scores and other score components. However, these 
relationships have not been fully specified. 

The earlier sections of this paper give only a small sample of 
formulas which differ from commonly stated ones when the guessing 
component of scores is considered separately. Since it is difficult to 
justify including the result of a correct guess in either the true 07 
error component of scores and in view of the comments just made, 
a considerable amount of theoretical development is suggested. 
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A FACTORIAL STUDY OF THE ILLINOIS TEST 
OF PSYCHOLINGUISTIC ABILITIES WITH 
CHILDREN HAVING ABOVE AVERAGE 
INTELLIGENCE 


MILTON WISLAND aw» WESLEY A, MANY 
Northern Illinois University 


Ker and McCarthy (1961a) recently developed a test de- 
signed to identify psycholinguistic abilities and disabilities in chil- 
dren between the ages of two and one-half and nine. Psycholinguis- 
tic is a term used by the test author to refer to those psychological 
processes a child uses in acquiring language. The experimental edi- 
tion evaluated consisted of nine tests selected to pictorially illus- 
ib linguistic strengths and weaknesses in children. The test was 
designed after a theoretical model of communication processes 
adapted from Hull's Theory of Learning by Charles Osgood (1957 )" 

The research condueted on this experimental edition to date, with 
pou handicapped children, has revealed that this instrument 
the Potential as a diagnostic instrument with the children used in 

ese research projects. Bateman (1963), Kass (1963), Kirk, Kass, 
Bateman (1962), McCarthy (1963), Olson (1963), Semmel and 

E (1962), Sievers (1963), Smith (1962), and Wisland and 
hs (1967) have been largely responsible for the research in this 
abhi, which included studies of cerebral palsied children, receptive 
cnl expressive aphasics, deaf children, partially sighted, men- 

usi and intellectually superior children. i 
heme Purpose of the study reported here was to determine the ef- 

telli BUSES 08 the ITPA with children who have above average m- 
Eso. As indicated above, this instrument has been positively 
Sen, by many workers in the field. Little is known, however, 
5 the effectiveness of this test among children on the upper 

nge of the intelligence scale, 
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Method 


Subjects 


The subjects consisted of 97 children attending Northern Illinois 
University Laboratory and Nursery School, who demonstrated that 
their intelligence was above average through classroom performance 
and group intelligence tests routinely administered in the school 
program. Students who had not been tested because of their new- 
ness to the program or for other reasons, were given the Peabody 
Picture Voeabulary Test. 

In addition to the intelligence tests, each child was also given the 
Massachusetts Vision Sereening Test, and an audiometrie sweep- 
test to determine normalcy of vision and hearing. No child included 
in the study had any observable defects and was consequently as- 
sumed to be “normal” except for his above average intelligence. 


Procedure 


Eight examiners were used to administer the tests using standard- 
ized procedures of administration given by the test authors. An in- 
ternship program was established to train them in administrative 
procedures. Each examiner was given a specific age group to eX- 
amine in the research project to assure competency within a specific 
age range. 

Each child included in the study was examined twice by the same 
person with a two week period provided between each test to deter- 
mine a coefficient of stability as well as provide data for the factor 
analysis. 

Individual testing rooms were used with one-way mirrors and in- 
tercoms to make periodic observations of the testing activity. As- 
sistance from the research director was available to examiners at 
all times during the testing periods. t 

Intercorrelations of all variables were computed using à centroid 
method of factor analysis to provide an estimate of the factor load- 
ings. As Guilford (1954), Harman (1960), and Horst (1965) have 
stated, this procedure corresponds fairly closely to other more T18- 
orous mathematically factorial procedures. 

The centroid factor matrix given in Table 2 shows the factor 
loadings on each subtest of the ITPA. Table 3 shows the final 10- 
tated matrix. An orthogonal rotation was used to establish Guil- 
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TABLE 1 
Correlation Matriz Showing Intercorrelations of Sub Tests 


Factors 
Variables* 1 2 3 4 5 6 7 8 9 
1, .639 .556 .681 .611 .502 .556 .469 .444 
.639 .686 .604 .568 .611 .391 .613 466 
3 .556 .686 .608 .503 .551 .409 .611 478 
4. .681 .604 .608 .662 .573 .556 .656 .559 
5. .611 .568 .503 .662 .397 .506 .557 .482 
6. .502 .611 .551 .573 .397 .265 .507 .441 
T. .556 .391 .409 .556 .506 .265 .484 .600 
8. .469 .613 .611 .056 .557 .507 .484 .528 
9. .444 .466 .478 .559 .482 .441 .600 .528 
* Subtests of the ITPA 
1, Auditory Decoding 6. Motor Encoding 
2. Visual Decoding 7. Auditory-Vocal Automatic 
z Auditory-Vooal Association 8. Auditory-Vocal Sequencing 
» Visual-Motor Association 9. Visual-Motor Sequencing 
5. Vocal Encoding 


lord's criteria of simple structure (maximal number of zero factor 
loading) and positive manifold, in that these were ability vectors, 
and consequently, assumed to be positively related to each other. 
As the reader can see, the sum of the squared loadings show that the 
ist factor accounts for 59 per cent of the total common-factor var- 
lance, Factor three is next in order of importance in that it ac- 
in for 13 per cent. The remaining seven factors combined ac- 
n. ior only 28 per cent of the total common-factor variance, 
tion SNAM they do not appear to make a significant contribu- 
o the test results. 
E^ E is ^ transformation matrix which has been included for 
fina] SIS information for converting the centroid matrix into the 
Totated matrix. Thus, for verification, the centroid matrix mul- 


de 
ES by the transformation matrix will give the final rotated 
1X. 


Discussion 
M pears from the present data that Kirk and McCarthy have 
hine E id support for justifying the hypothetical construct of 
One ak js their test when used with a population similar to the 
&ona] EE in this study. An attempt was made during the ortho- 
indic, ation of the factor matrix to isolate nine factors. As Table 4 
ates, there appear to be two factors that are contributing most 
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TABLE 4 

Transformation Matriz 

Factors 
Variables 1 2 3 4 5 6 7 8 9 
1 18  .423  .357  .207 .208  .149 253 .129 Il 
2 -335 —.907  .167 .097 .097 .070 118 000 052 
3 «277  .000 —.900  .080 .230 .203 098 048 052 
4. -294  .000 .026 —.809 102 .078 —.288 046  .08 
5 267 .000 —.129 —.020 —.896 —.021  .083 001 3% 
6 278 .000 —.129 —.021 1012 —.915  .021 1109 — 313 
7. 190 .000 .000 .365 .008 —.010 —.903 019 12 
8.  .194 .000 .000 .000 .008 —.010  .003 —.970 —.135 
9. -148 .000 .000 .000 —.984 352 —.088 145 —.884 
SE :852 —.088  .146 BE 


of the meaning to the diagnostic profile of this test, factor one and 
three. Together, they make up 72 per cent of the total common- 
factor variance of the entire test. (See the last column of Table 3.) 

The first factor appears to be almost equally weighted on each of 
the nine subtests. For this reason, the writers are assuming that this 
factor represents a general psycholinguistic factor that is common 
to each test, 

If one studies the hypothetical model of psycholinguistic abilities 
Proposed by Kirk and MeCarthy (see Figure 1) it would be logical 
to assume that if this theory actually works in practice the identi- 
fied factors would fit into the pattern proposed in the model. The 
test authors have postulated that three major dimensions are needed 
to specify a given Psycholinguistic ability; they are levels of organi- 
zation, psycholinguistic Processes, and channels of communication. 
Each major dimension is then subdivided as discussed in the follow- 
1ng paragraphs. 

Levels of organization deseribe the functional complexity of the 
organism. Two levels are identified as being important for language 
acquisition and use: (a) The representational level which is sufi- 
ciently organized to mediate activities requiring the meaning or sig 
nificance of linguistic symbols, and (b) the automatic-sequential 
level which mediates activities requiring the retention of linguistic 
symbol sequences and the execution of automatic habit-chains. It 18 
their belief that normal Acquisition and use of language depends 
on both levels, 

The second dimension includes psycholinguistic processes, which 
encompasses the acquisition and use of the habits required for nor- 
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Decoding ‘Association Encoding 
3 
1 5 
Representational 
2 Level 6 
4 
7 
Automatic- 
Sequential 
Level 
8 | 
Auditory & Visual Stimuli Motor & Vocal Responses’ 
Representation Level Automatic-Sequential Level 
1. Auditory Decoding 7. Auditory-Vocal Automatic 
2. Visual Decoding 8. Auditory-Vocal Sequencing 
8, Auditory-Vocal Association 9. Visual-Motor Sequencing 


4, Visual-Motor Association 
5. Vocal Encoding 
6. Motor Encoding 


Note—This model was taken from the Examiners Manual of Illinois Test 
of Psycholinguistic Abilities, 1961, prepared by Kirk, Samuel A., and 
McCarthy, James J. 


Figure 1. A model of psycholinguistic abilities. 


mal language usage. There are three sets of habits considered here: 
(1) Decoding, the sum total of habits required to ultimately obtain 
meaning from either visual or auditory linguistic stimuli, (2) en- 
coding, the sum total of habits required to ultimately express one- 
‘elf in words or gestures, and (3) association, the sum total of habits 
required to manipulate linguistic symbols internally, that is, Lie 
the central and peripheral nervous system. 

The third dimension includes the channels of communications 
Which describes the sensory-motor path over which d: 
bols are received and responded to. It is divided into the mode of 
Teception and response. 

The column captions of Table 3 represent those areas of the o 
described above (see Figure 1) that are included in the Illinois Test 
of Psycholinguistic Abilities, including each subtest. As indicated 
above, it appears logical to assume that if the test actually repre- 
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sents the model in practice as well as in theory, the various factors 
would tend to cluster within the columns they represent. As one 
can see when examining Table 4, the loadings of the various fac- 
tors are not clustered, but are scattered with a few exceptions. Fac- 
tor two is loaded on sequential tests eight and nine, and thus may 
be considered a factor of sequential skills. Factor six is also a se- 
quential factor that is largely concerned with visual motor skills 
involving the correct reproduction of a sequence of symbols previ- 
ously seen. 

Factor eight does not cluster into one of the columns or major 
levels of organization such as the representation level or automatic- 
sequential level, however, it does appear to include decoding skills 
involving auditory directions in which the subject must comprehend 
the vocabulary he hears. 

Factors three, four, five and seven do not appear to fit any pat- 
tern, or require any common activity or skill of the examinee, at 
least as far as the writers could determine. 

Factor nine is believed to be a residual factor in that it lost rather 
than gained variance in rotation, and ended up with no substantial 
factor loadings. Substantial is defined here as greater than .25 or 30 
(2:500). It should also be noted that loadings of .10 or less should be 
regarded as zero for all practical purposes, because the experimental 
sample is considered as being small, which would of course increase 
the sampling errors and the resulting standard errors of the factor 
loadings (2:508). 


Summary 


The results of this factorial study of the Illinois Test of Psycho- 
linguistic Abilities revealed that there may be as many as nine fat- 
tors involved in this test. Only three of these factors, howevet 
should be considered as contributing much in the way of diagnosti¢ 
information in that they account for 79 per cent of the total com- 
mon-factor variance of the entire test. The remaining five factors 
appear to be inconsequential in their significance. 

An attempt to place the identified factors into the psycholinguis" 
tie model proposed by Kirk and McCarthy (1961b) in their test 
manual, was not successful for the majority of the factors identified. 
Factor one was considered to be a general psycholinguistio factor 
because of its consistent loading on each of the nine subtests. Factor 


’ 
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two appeared to be a general sequencing factor, factor six a visual 
motor sequencing factor, and factor eight an auditory factor involy- 
ing vocabulary activity. The remaining factors were not identified. 

Although a ninth factor was identified in the centroid matrix, it 
was later considered as a residual factor after rotation because its 
loadings were in essence reduced to zero in the rotated matrix. 

The writers have thus concluded that the findings of this re- 
search do not support the hypothetical construct proposed by the 
authors, that this test contains nine distinct factors. 
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PSEUDOPROGRESSIVISM AND ASSESSMENT 
OF TEACHER BEHAVIOR? 


ELAZAR J. PEDHAZUR 
New York University 


Tue scientific study of educational attitudes has a relatively 
short and meager history. This is in contrast to the central position 
of attitudes in social psychology (Allport, 1954; Katz and Stotland, 
1959). While two large factors of educational attitudes have been 
identified, namely “progressivism” and “traditionalism,” and while 
subsets of these factors have been studied (Kerlinger, 1958, 1967), 
educational attitudes have been, for the most part, taken at face 
Value. Except for some work in the area of fakability of educational 
attitudes (e.g., Coleman, 1954; Stein and Hardy, 1957), there is no 
evidence of attempts to study fundamental differences within sets 
of individuals who, according to their responses to educational atti- 
tude scales, were labeled progressives or traditionalists, 
parus the motivational bases of expressed attitudes, treating 

yids which are phenotypically similar as if they were also 
aial similar, resulted in conflicting findings and meaning- 

88 predictions. 
po been shown that attitudes similar in content emanate from 
E Bridal structures. Adorno, Frenkel-Brunswik, Levin- 

cage anford (1950), for example, distinguished between pseu- 
tordin eatin and what one may call “genuine” conservatism. Ac- 
fos vod et al., there is little difference between the two 
E. evel of expressed conservative attitudes; but they 
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are fundamentally different in the underlying readiness for radical 
change and in the democratic and anti-democratic tendencies. Simi- 
larly, Allport (1966) distinguished between two religious orienta- 
tions, “extrinsic” and "intrinsic," and showed how these were dif- 
ferently related to prejudice. The extrinsically religious pattern is 
well illustrated in the story about an orthodox J ew who, during the 
seige of Jerusalem, was engaged in prayer for the protection of the 
city. Upon seeing a friend, he implored him to join him in prayer. 
“Rely on God,” said the friend, “He'll protect us.” 
"Don't rely on God,” retorted the first. “You better pray.” 
Fromm (1941) speaks of pseudo thinking, pseudo feeling, and 
pseudo willing which lead to a pseudo self. His distinction between 
“freedom from” and “positive freedom” can be viewed as pseudo 
freedom and genuine freedom respectively. Fromm maintains that 
the criterion for Tegarding a statement as representing pseudo think- 
ing is not its logicality, as such, but rather the kind of motivational 
forces behind the statement, "The decisive point is not what is 
thought but how it is thought (Fromm, 1941, p.95).” f 
In educational attitudes, too, one must consider the function 
served by the attitudes. Expressed progressivism, for example, may 
serve, for some individuals, an adjustive purpose. That is, an indi- 
vidual may express progressive attitudes as a result of converging 
on the norms of a school with which he is affiliated. He may choose 
to espouse progressive attitudes as a means of getting or maintain- 
ing a position in schools known to be progressive. j 
Certain individuals, moreover, may have a self-image of beimg 
Progressive and will therefore express progressive attitudes. Such 
individuals derive satisfaction from the mere expression of attitudes 
congruent with their self-image (Katz, 1960). Still other individ- 
uals may express a progressive attitude because they believe in p10- 
gressivism and are committed to its precepts. 
And then there are those who express progressive attitudes o 
ego-defense. One of the mechanisms employed is probably reaction 
formation, That is, “the development in the ego of conscious es 
ized attitudes which are the direct opposite of repressed wishes in d 
unconscious (Blum, 1953, p. 107).” 
The present investigation has attempted to differentiate betwee 
two educational attitude patterns, namely pseudoprogressivism m 
genuine progressivism, and has studied the relations between thes 
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patterns and assessment of teacher behavior. It was hypothesized 
that pseudoprogressives will assess teachers exhibiting seemingly 
progressive behavior more positively than will genuine progressives. 
Tt was believed that the genuine progressive student of education, 
who has not had the experience of working independently for any 
length of time with a class, may be more prone to be “taken in" by 
seemingly progressive behavior than will a genuine progressive 
teacher. It was, therefore, further hypothesized that students will 
assess teachers exhibiting seemingly progressive behaviors more 
positively than will teachers. 

Man has long been aware of the effects of emotions, attitudes, and 
other inner states on cognitive processes. Evidence of this awareness 
can be found in the folklore, literature, philosophy, and art of all 
peoples. Only recently, however, has there been systematic scientific 
study of the influence of inner states on cognition. Among notable 
contributors toward an integrating theory are Bruner (1951) 
and Postman (1951). 

à In the present investigation the inner states were educational at- 
titudes, more specifically, genuine progressivism and pseudoprogres- 
sivism. As is the case with other philosophies of education that have 
had à great impact on human thought and behavior, it is difficult to 
find a definitive statement of progressivism that is supported by all 
or even most philosophers or educators. According to Brubacher 
ERU. progressivism implies change, which in turn implies novelty. 
Sidi is confronted with a world that undergoes change at dif- 
de rates at different times. Progressive education emphasizes, 
selfen m the problem-solving attitude of mind, or initiative and 
E fi Vend The cultivation of individual differences, the emphasis 
E s interests and needs, the approach to value and truth 
ew = concrete experience of the individual, and an attach- 
n) Ond democratic process and a pluralistie view of society are 

All to e basic principles of progressivism. 
si A often one encounters teachers who profess to be progres- 
thetical m m word and in deed, betray attitudes which are anti- 
haps UNE the principles which they presumably uphold. Tt is, per- 

SER = Inconsistency that may serve as à means for distinguishing 
ee o oprderegiyes and genuine progressives. 

mucture (1960) called for a distinction between the content and 

of an attitude. An individual may, for example, hold an 
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attitude of permissiveness toward children but be authoritarian | 
about this attitude and intolerant to anyone who disagrees with 
him. The attitude patterns which one witnesses in the civil rights 
and peace movements and the wide variety of behaviors associated 
with these patterns are another illustration of the need to distin- 
guish between the content and structure of an attitude. 

The concept of dogmatism (Rokeach, 1954) presumably describes 
a relatively closed cognitive organization of beliefs and disbeliefs 
about absolute authority. This set of beliefs, in turn, provides a 
framework for patterns of intolerance and qualified tolerance to- 
ward others. Dogmatism is, therefore, antithetical to progressivism. 
Progressivism prescribes tolerance toward others, a cognitive or- | 
ganization that is open to change, and “equality and warmth in in- 
terpersonal relationships (Kerlinger, 1958, p. 112).” Pseudoprogres- 
sivism can, therefore, be defined as a pattern of attitudes which are 
progressive in content but dogmatic or closed in structure. Genuine 
progressivism, on the other hand, can be viewed as a pattern of at- 
titudes which are progressive in content and open in structure, that 
is, not dogmatic. 


Method 
Materials 


Progressivism was measured by Education Scale VII (Kerlinger, 
1967; Kerlinger and Pedhazur, 1967). Education Scale VII (ES- 
VII) is a 30-item, seven-point, summated rating scale consisting of 
15 traditionalist items and 15 progressive items. In the present in- 
vestigation only the 15 progressive items were used. The reliability 
(alpha) of this scale with a sample similar to the one under study 
was .76 (Kerlinger and Pedhazur, 1967). Dogmatism was measured 
by the Dogmatism Scale (Rokeach, 1960). 

Operationally, a pseudoprogressive is a subject whose scores 0? 
the progressivism scale (ES-VII) and the Dogmatism Scale i 
Scale) are above the means of his group. A genuine progressive 18 à 
subject whose score on ES-VII is above the mean of his group am 
whose score on the D Scale is below the mean of his group. 

The dependent variable, assessment of teacher behavior, V9 
measured by Teachers At Work Scale (TAW) constructed by oe 
author. TAW consists of six episodes which describe teacher-PUuP" 
interactions. All the episodes were meant to appeal to pseudopro- 
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gressives in that they are seemingly progressive; that is, they ap- 
pear to be progressive and employ the mechanics of progressivism, 
but their latent meaning is contradictory to progressivism. In each 
episode, the teacher exhibits some or all of the following behaviors: 
manipulates the students, encourages destructive criticism, intra- 
group aggression, competition, confession, and the like. 

Three episodes were adapted from Miel (1952). One of these epi- 
sodes which deals with a teacher's planning is described by Miel as 
an example of “the stifling kind of preplanning that is bound to vio- 
late the very aim she states, use of ‘cooperative planning and the 
democratic process’ (p. 278).” In another episode, which describes 
^ weekly class meeting, "the teacher plays such a dominant role 
that at every critical point the leader is robbed of the opportunity 
to make a judgment. This may be called an autocratic type of 
teacher guidance of a pupil leader (Miel, 1952, p. 345).” 

Two episodes were adapted from Henry (1957). In one of the 
episodes the teacher calls upon each student to relate his good and 
bad deeds of the week. Classmates are called upon to testify to the 
truthfulness of the statements while the teacher records them in the 
student's booklet entitled “All About Me.” Henry points out that 
E^ cher involved probably thought she was teaching the pupils 
these i and upright, but that her unconscious tendencies caused 
E. Nu y aims to be wrongly expressed. Of the other episode, 
EM Students read their stories and classmates criticize them, 

Em that creating stories and discussing them by the 

ever. ee the principles of progressivism. In the episode, how- 

triticize nor 8 own (at times unconscious) need to carp and 
and sup Es i» the way of her adequately developing the creative 
One * 2 d possibilities in her charges (Henry, 1957, p. 133) ." 

ex ilits « ode, written by the present author, depicts a teacher who 
| ings of a Recreo lack of understanding and regard for the feel- 
Ens boy, "d p 18 rejected by his classmates. In an effort to help 

criticize — s the situation by encouraging all ia 

| Boe Were preceded by a statement that they were taken 

act as a mente observations. The respondent was asked to 

episode ong "s "M Observer and rate each teacher involved in each 
“point scale from very poor to excellent? 


For a 
m i soar 
Pedhazur (logy) detailed description of the scales used and sample scales see 
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The reliability of TAW (coefficient of stability with a two-week 
interval) was for teachers (N = 72) .82 and for students (N = 87) 
80. 


Subjects 


One hundred fifty-nine teachers and 174 students from the New 
York Metropolitan area whose scores were above the mean on ES- 
VII were retained for analysis. 


Results 


The means and standard deviations of the subsamples on ES- | 
VII, the D Scale, and TAW are reported in Table 1. Note that the | 
differences between the means on the D Scale of the genuine pro- 
gressive and pseudoprogressive groups are considerable (over 25 
standard deviations). The difference between the means on TAW 
of genuine progressive and pseudoprogressive teachers is nearly one 
standard deviation and that of genuine progressive and pseudopro- 
gressive students about .75 standard deviations. 

The TAW scores were subjected to a 2 x 2 factorial analysis of 
variance using attitudes (pseudoprogressivism and genuine pro- 
gressivism) as one independent variable and teaching experience 
(teachers and students of education) as the other independent vari- 
ble. Since the frequencies in the cells are unequal, a harmonic mean 
transformation was performed (Winer, 1962, pp. 241-244). 

In Table 2 will be found the summary of the factorial analysis of 
variance of the TAW scores. The means of 14.47 and 17.52 of genu- 


TABLE 1 
Means and Standard Deviations of Subsamples on ES-VII, D Scale, and TAW 
Teachers Students 

Genuine Pseudo- Genuine Pesta 

Progressive ^ progressive progressive progress! 
LLL... Progressive progressive progressive progressiv? 
N: 78 86 94 80 

ES-VII 6.22 6.13 6.09 6,09 
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TABLE 2 
Analysis of Variance of TAW Scores 


Source df MS F 
Genuine vs. Pseudo 1 818.70 51.04** 
Teachers vs. Students 1 112.24 6.99* 
Interaction 1 6.60 «1 
Within Cells 329 16.04 
*p <.01. 
**p < 001. 


ine progressives and pseudoprogressives respectively are signifi- 
cantly different at the .001 level (F = 51.04, df 1/329). The 
hypothesis that pseudoprogressives will assess more positively teach- 
ers exhibiting seemingly progressive behavior is supported. The 
mean difference between genuine progressives and pseudoprogres- 
Sives is about .75 standard deviations. 

There is also a significant difference between teachers and stu- 
lente (F = 6.99, df 1/329, p < .01) indicating that teaching expe- 
pee may have an effect on the assessment of teacher behavior. The 
Magnitude of the difference between the means, however, is about 
2% standard deviations. The difference between the effect of the 
attitude variable and the teaching experience variable is also indi- 
tated by the estimated o? (Hays, 1963, p. 407), being .15 and .02 re- 
Spectively. 
ua degree of relationship between the variables was studied via 
VII i r. In Table 3 are reported the correlations between ES- 

: D, and TAW for teachers and students of education. While 

f Correlations between ES-VII and D and between ES-VII and 
hover around zero for both groups, the correlation between , 
iion Scale and TAW for teachers is 45 (p < .001), accounting 
dee drin of the variance in TAW; and the correlation be- 
ing for 16 Scale and TAW for students is 40 (p < 001), account- 

etween a er cent of the variance in TAW. The lack of correlation 
Measure of E T and the D Scale was a basic premise in defining the 
tween ifa independent variable. The lack of correlation be- 
‘a's content ig TAW indicates that knowledge of an individ- 
how this ; iS Progressive attitudes will not enable one to predict 

individual will assess seemingly progressive behaviors of 


the D 
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TABLE 3 
Correlations between ES-VII, D Scale, and TAW for Teachers and Students 


Teachers Students 

N 159 174 
ES-VII D TAW ES-VII D TAW 
ES-VII 1.00 —.09 —.06 1.00 .00 — 05 
D 1.00 45* 1.00 40* 
TAW 1.00 1.00 

*p <.001, 
Discussion 


A few keen observers have pointed out that a great deal of what | 
passes for progressivism has no resemblance to it and, in many 
cases, is diametrically opposed to it. Riesman, Glazer, and Denny 
(1950), for example, argue that “educational methods that were 
once liberating may even tend to thwart individuality rather than 
advance and protect it (p. 60).” Similarly, Fromm (1947) observes 
that the good ideas of progressivism were perverted by parents and 
educators. Overt authority, Fromm says, has given way to anony- 
mous authority, which in many ways can be more oppressive. 

The present investigation was concerned with distinguishing be- 
tween pseudoprogressivism and genuine progressivism and studied 
the relations between these patterns and assessment of teacher be- 
havior. It was found that pseudoprogressives assess teachers exhibit- 
ing seemingly progressive behavior more positively than do gms 
uine progressives. It was further found that students of education 
assess seemingly progressive behavior more positively than do 
teachers. ; 

One should, therefore, not speak about abuses of progressivism 
per se but rather study the kind of person who is prone to do $0. In 
the present investigation, paper and pencil instruments were used. 
It is important to study pseudoprogressivism as it is being mani 
fested in actual behavior in the classroom. Anyone who is familiar 
with the educational process has probably encountered the pseudo- 
progressive teacher for whom the means of progressivism have i 
come an end, the teacher who employs the means of progressivis™ 
without any relation to its end, the teacher who espouses progres- 
sive education and yet manipulates and dominates his students. ; 

Hovland and Sherif (1952) have suggested using an individual's 
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judgments in order to make an inference about his attitudes. In the 
context of the present study, it is believed that on the basis of a 
- gubject/s assessment of teachers one can discern the kind of behav- 
iors the subject endorses and, perhaps, further project that the 
subject will act in a similar manner in the classroom. If this line of 
reasoning is correct, one should be apprehensive of the deleterious 
consequences of entrusting a group of students to a pseudoprogres- 
sive teacher. 
_ What values and attitudes does a teacher, who consciously claims 
to be what he is not, impart to his students? What kind of interper- 
sonal relationship develops in a class whose teacher’s pseudopro- 
. Bressivism is unconscious? These are some of the questions which 
should not be left unanswered by anyone who is concerned with the 
educational process. 

The study of attitudes and their relations to cognitive processes 
and behavior can be enhanced by investigating the phenomenon of 
Pseudoism. The definition of pseudoism, its function, and the means 
of measuring it will require a great deal of work. The present inves- 
tigation can serve only as an example of an attempt in this direction. 
This attempt was largely based on the distinction between the con- 
tent and the structure of educational attitudes, more specifically, 
ed attitudes. Other attitude domains and other approaches 
othe study of pseudoism should prove to be useful. 
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IMPLICATIONS OF THE STRUCTURE-OF-INTELLECT 
MODEL FOR SELECTION AND PLACEMENT OF 
COLLEGE STUDENTS? 


WILLIAM B. MICHAEL 
University of Southern California 


Auraovcu the basic properties of the structure-of-intellect (SI) 
model have been reported frequently (e.g, Guilford, 1963, 1967; 
Guilford & Hoepfner, 1966), it appears that its implications for the 
selection and placement of college students have not been given ex- 
ae or at least, formal consideration in the professional litera- 
the ages because the implications have seemed too obvious to 
the oM psychologists who have systematically contributed to 
ae ication of the constructs of the model. In this paper, as was 
ET ki one recently published (W. Michael, 1965), the writer's 
UNI esis d that within the social psychological context of a 
uU Th l defined and relatively homogeneous college environ- 
A E ich the expectations and value-systems of faculty mem- 
i at y Ro and administrators are known and mutually shared, 
D d el affords the basis for the development and deseription 
means Ete reliable criterion variance and thus the potential 

imn of improvements in the effectiveness and in 
Ke ties: or criterion-related validity of measures employed in 
nity exists a Placement of college students. That this opportu- 
itional UT enhancing the criterion-related validity of both tra- 
sls o) c of convergent thinking and recently developed 
model allows ad thinking rests upon the potentialities that the SI 
or the E development of a theory of learning as well as 
cts Nr of a related theory of instruction with con- 
TUR With those of the SI model. Briefly stated, once 
gie side is based in part on a presentation given at a special commem- 


Paper- M 5 
1 nal mest? session honoring J. P. Guilford, which was held at the 
of the American Psychological Association, Los Angeles. 
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the process objectives of the numerous criterion variables repre- 
senting outcomes of diversified teaching-learning activities can be 
described in terms of the constructs of the SI model and then as- 
sessed through use of appropriate measures such as achievement 
tests, laboratory projects, samples of written work, and other prod- 
uets of student endeavor, appropriate steps can be taken to de- 
sign or to choose aptitude tests which sample the same process ob- 
jectives as those in the criterion measures. The obvious purpose of a 
theory for learning and of one for instruction built around the con- 
structs of Guilford’s SI model is to furnish a common frame of ref- 
erence that permits (1) the description of psychological character- 
istics of various types of teacher and student activities, (2) the 
specification of educational objectives in familiar terms, and (3) 
the assessment of students’ attainment of these objectives through 
use of suitable achievement tests or other samples of students’ 
productive and creative efforts. 

Purpose. It will be the purpose of the writer (1) to describe the 
need for a social psychological orientation to the problem of sele¢- 
tion and placement of students prior to the use of the SI model in 
any college setting, (2) to point out relationships of Guilford’s 
model to learning theory as well as to ascertain specific paradigms 
that have been proposed to describe the teaching process, and (3) 
to emphasize the desirability of achieving through collaborative in- 
stitutional research efforts a certain degree of unity among theories 
of intellectual functioning, learning, and teaching if improvements 
in the criterion-related validity of devices for selection and place- 
ment of college students are to be realized. Implementation of the 
results of such institutional research can be expected to be successful 
only if the faculty members, students, and administrators feel ? 
personal involvement in the undertaking. 

Need for a social psychological theory for selection and place- 
ment of college students. Recent contributions by Fishman (1962), 
W. Michael (1968), W. Michael and Boyer (1965), Stern (1962), 
and Thistlethwaite (1963) served to demonstrate either explicitly 0" 
implicitly the importance of college environments relative to th? 
empirical validity of both commonly used and experimental intel- 
lective and nonintellective predictors. Since Guilford’s model po 
trays primarily intellectual activities instead of those associate 
with dimensions of temperament, interest, and motivation, it wer 
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‘ential that due allowance be made for moderating influences that 
"individual campus environments and climates may have upon both 
the absolute and relative degree of importance of certain inferred 
‘abilities portrayed by constructs of the model. In other words, there 
are probably families of college environments which furnish sets of 
‘mediating parameters to determine weights that could be assigned 
to each of the constructs identified by cells of the familiar cubical 
representation of the SI model or at least to particular layers of cells 
in the cube. 

That certain types of undergraduate college environments are 
more productive than are others of students who will eventually at- 
tain a Ph.D. degree was dramatically illustrated by Thistlethwaite 
(1963). After making appropriate adjustments in terms of supply 
(numbers) of National Merit Scholarship finalists attending differ- 
mt colleges, he noted substantial differences in the Ph.D. produc- 
tivity index for undergraduates from various families, or types, of 

Colleges as well as differential patterns in the correlations of scales 
of environmental characteristics with productivity of Ph.D.’s in the 
EU sciences and in the arts, humanities, and social sciences. In 
B the correlational results suggested that strikingly dif- 
E ed patterns of student and faculty cultures were related to 
E. ET of doctorates in the arts, humanities, and social sciences 

! E and and in the natural sciences on the other. Thus, the 
bx SÉ to be learned from this investigation as well as from 

deal of s E nen (1962) and Stern (1962) is that there is a great 

bun E oity in value systems of the faculty and students in a 

Bios BÉ v or even in various colleges on the campus 

ments in the u gs specificity which suggests necessary adjust- 
evelopment, E the SI model from one campus to another in the 
students. ests and measures for selection and placement of 


E, = Previously, the characteristics of the college environ- 
teac] eend peted to influence the whole process of learning and 
Outcomes of | us to modify the constructs that are involved in the 
ist in en to be evaluated. To the extent that differences 
hievement, n. Strotitho criterion variables reflecting scholastic 
cement i e constructs in measures employed for selection and 
ahi of college students must be correspondingly modified if 


e e 
&ree of criterion-related validity is to be attained. 
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In a pioneering and insightful social psychological theory, Fish- 
man (1962) presented nine models representing permutations of 
intellective and nonintellective predictors with intellective and non- 
intellective criterion variables. These models, which also allow for 
similarities and differences in high school and college environments, 
hold implications of a highly promising nature for the use of the SI 
model in college selection and placement. Fishman’s extensive work 
led him to conclude that neither nonintellective predictors such as 
temperament, biographical, or study-habits inventories nor non- 
intellective criteria such as the level of overachievement or under- 
achievement, the extent of extracurricular participation, or per- 
sonality ratings are likely to assume any degree of practical 
importance in criterion-related validity studies for college students. 

In the instance of nonintellective predictors, Fishman (1962) re- 
ported that in a survey of 580 studies in college guidance and selec- 
tion the median multiple correlation between a familiar criterion of 
freshman grade point average and a combination of the two cus- 
tomary predictors of high school grades and a standardized scholas- 
tie aptitude test was .55 and that the addition of some type of per- 
sonality test to the composite yielded a median gain of only .05—a 
finding which he attributed in part to the fact that high school 
grades customarily contain substantial amounts of nonintellective 
variance. With respect to the nonintellective criterion, Fishman 
stated that the traditional emphasis of college faculties upon the 
Importance of intellective and academically oriented criteria may 
be expected to predominate for many decades in the college setting. 
Tf true, the SI model may anticipate a long life expectancy in its 
potential role as a vehicle for improving college selection and 
placement procedures, 

Despite his emphasis on measures reflecting intellectual activities 
Fishman Would make substantial use of what he terms institutional 
Predictors and individual contingency moderators (unexpected and 
unpredictable events such as illness, deaths in the family, or acti 
dents) in order to salvage prediction of college success in individual 
cases. An instrument such as the College and University Environ 
ment Scales (CUES) (Pace, 1962) can furnish empirical asses 
e of institutional environments and can serve as predictors 
which can be advantageously used in conjunction both with indices 
of high school achievement and with measures of intellectual activ 
ties derivable from Guilford’s model. 
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Similarities in high school and college environments would prob- 
ably lead to giving substantial weights both to high school grade 
point averages and to those measures of constructs in the SI model 
that have been shown to be empirically related to available cri- 
terion measures. On the other hand, if the differences between high 
school environments and college environments were helpful to, 
unrelated to, or disruptive of college achievement, then high school 
grade point average would be given, respectively, positive, zero, or 
negative weights, and measures associated with the constructs of the 
Guilford model would be given, correspondingly, relatively small, 
moderate, or relatively large weights. Traditional multiple regres- 
sion techniques could be employed along with factor analytic ap- 
proaches to furnish relatively precise statistical bases for enhanc- 
ing the criterion-related validity of the intellective and institutional 
Measures employed. 
Relationships of Guilford's model to theories of learning and 
teaching. If increases in the criterion-related validity of measures 
used in selection and placement of college students are to be effected, 
it is also essential that improvements be made both in the descrip- 
tion of the psychological characteristics underlying the criterion 
Variables employed and in the subsequent assessments of criterion 
Bin relative to the constructs hypothesized and verified in the : 
* a n the instance of scholastie achievement this understanding 
logical cm through the development and application of psycho- 
iori “A t to the learning and teaching process. 
Bus mo some time ago recognized that the SI model rep- 
a TEN soci a general theory of learning primarily 
eis, e achievment of information not only through cog- 
E overy, but also through the convergent and divergent 
E pem thinking. Broader than the conventional 
ban : Rcs theory which have been largely limited 
count forse nd figural content, Guilford’s model can readily ac- 
" ib hes content as well as for behavioral (social) content. 
association, = arities of the SI model could be shown to classical 
m due of learning and although interpretations could 
ories in which orced to demonstrate certain consistencies with the- 
(but not all Js opent is central, it would appear that d 
cvi NADA activities inferred from Guilford s 
dS E to contemporary theories of learning in which 
ing of problematical situations takes place. 
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Gage (1963, 1964) argued most persuasively for the development 
of theories of teaching in harmony with his distinction that theories 
of learning are aimed toward description of ways in which the or- 
ganism learns, whereas theories of teaching are concerned with 
ways in which an individual influences an organism to learn. Simi- 
larly, Ryans (1963) stressed the need for a theory of teaching and 
presented a comprehensive information system model in which the 
teacher is an information processing organism. Ryans’ theory em- 
bodies: (a) both internal and external inputs, (b) the information 
processing operation itself including filtering, reintegration, deci- 
sion making, encoding, channeling, and control of initial informa- 
tion as well as of feedback information, (c) outputs of teacher 
information processing—the teacher behaviors, and (d) the prod- 
ucts of teacher behaviors, which are in fact the student behaviors. 

In many respects, the degree of resemblance between properties 
of Ryans' information system model and the characteristics of the 
informational theory of behavior underlying the new structure-of- 
intellect problem-solving (SIPS) model very recently described by 
Guilford and Tenopyr (1968) is surprisingly high. In this SIPS 
model, which is based on an analogy of human problem-solving 
activities to the information-processing operations of a computer 
as in cybernetics, the individual becomes aware of a problem; 
establishes a “search model”; selectively seeks information (often 
expressible in SI content categories) in his memory storage, 
his somantic states, and his environment as required; continues 
to make successive cognitions and evaluations of initial, altered, 
or new inputs, which are being constantly monitored and modi- 
fied by a filtering process and feedback mechanisms; and event- 
ually—sometime after a large number of cycles—finds answers 
that may be described in terms of familiar SI product cate- 
gories. It may well be possible to develop a paradigm consisting of 
two SIPS models facing each other—one “search model” for the 
teacher and one “search model” for the student—in which the indi- 
vidual problem-solving activities of teachers and students as well 9 
i. interactions and mutual feedback effects in the teaching- 
learning enterprise could be portrayed schematically. N 

The behavioral content (social intelligence) dimension, which 
cuts across the five psychological operations and the six products, 
or restructured forms of information, arising from processing ° 
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initial content information in the SI model, would probably play a 
vital role in the explanation of student teacher interactions and of 
related empathic experiences. Already important efforts have been 
directed by Guilford, Merrifield, deMille, and O'Sullivan (1962) in 
the development of tests consisting of photographs, cartoons, line 
drawings, and silhouettes to represent cognition of social behavior 
relative to the six products of the model—tests which are novel for 
the non-verbal content of items portraying social situations. In 
these items, individuals are placed in situations of social interaction. 
The examinee attempts to predict from a pictorial representation of 
expressive movements, such as gestures, facial expressions, and gen- 
eral stance what the individuals pictured in the items are thinking, 
perceiving, feeling, desiring, or contemplating. The challenge of 
synthesizing both these social characteristics as well as the other 
three content dimensions of Guilford’s SI model with the corre- 
sponding components in the information system model proposed by 
p. ns, with the key parameters of two other informational models 
Which Ryans (1963) reported, and with Guilford's new SIPS model 
Isindeed an intriguing one. 

Guilford’s SI and SIPS models would appear to offer important 
p to a description of both teaching behaviors and student 
EB that other models do not contain—particularly from the 
s) of providing five meaningful operations (psychological 
Eien , in terms of which measures of student achievement and 
bs. Ev can be devised. Moreover, his SI model and to 
tities " his SIPS model afford the teacher a means for classi- 
for. the al of materials presented to students as well as à system 
newly ES and categorization of their restructured or 
tipposed y forms of information. These products have arisen 
Bua. ter students have processed over a period of time the 
and discussi given them in their assignments, in classroom Jectures 

hus, Eu. orin achievement tests sampling course obj ectives. 

ents to em omy heen ge directed toward encouraging stu- 
carefully m a appropriate intellectual operations for attaining 
Provements ie outcomes of instruction, there should be um 

8 degree of r e in learning and teaching activities, but also in 
Subject matter E and validity with which both process and 
nce the sm jectives in the curriculum can be assessed. 

Ocess and content objectives in college courses are 
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carefully defined and reliably measured, aptitude tests can be con- 
structed to reflect the same psychological activities as those identi- 
fled in learning-teaching experiences. Such aptitude tests should be 
expected to yield relatively high criterion-related validity coeffi- 
cients and thus efficient means for selection and placement of stu- 
dents in appropriate courses of study provided, of course, that both 
professors and students can be sufficiently motivated to participate 
as subjects in institutional research studies. 

Advantages to achieving a unity in theories of intelligence, learn- 
‘ing, and teaching. The previous consideration of teaching and 
learning theories has pointed to a basic unity that exists between 
them and the Guilford SI and SIPS models. Unless there is a cer- 
tain approximation to congruency in the psychological objectives 
in the teaching-learning process, with those found in measures em- 
ployed in selection and placement of students, the marked lack of 
progress during the past thirty years in raising the criterion-related 
validity of various modes of college admissions cannot be expected 
to be altered, In short, the prognostic validity of aptitude tests is 
highly specific to the criterion of scholastic achievement employed. 

The highly specific nature of the criterion-related validity of ap- 
titude tests has been well illustrated in an investigation in which 
Hills (1955) reported the relationships of certain measures of hy- 
pothesized creative abilities to achievement of students in upper di- 
vision and graduate classes in mathematics at three institutions of 
higher learning. Through use of factor tests, he demonstrated that in 
each of two institutions the presence of certain hypothesized crea- 
tive abilities was evident when the approach in teaching stimulated 
students to employ these same hypothesized abilities in their prob- 
lem-solving activities. For the third situation in which the instruc: 
tional process did not encourage students to make use of any of the 
abilities that had been hypothesized to be in the tests of creativity, 
these particular abilities were not identifiable or apparently relevant 
to course success. The results of Hills’ research suggested that the 
types of abilities which could be expected to be evidenced from use 
of factor tests depended upon the particular curriculum-criterion- 
institutional context in which learning took place and that the ex 
isting variance in criterion variables could not be expected to reflec 
creative abilities unless these abilities were involved in the instruc- 
tional-learning process itself. Thus, without a certain degree of unity 
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between the constructs in the instruments used in selection and 
placement procedures and those construets found in evaluation 
measures of college experiences, a high degree of criterion-related 
yalidity cannot be realized for indices of admission and placement. 
Another illustration which points to the contribution which the 
SI model can make to the realization of unity in the evaluation 
process is apparent in the relationship of this model to the well 
known taxonomy of educational objectives prepared by Bloom, 
Engelhart, Furst, Hill, and Krathwohl (1956), the terminology of 
which resembles the categories in the operations dimension of the 
Guilford SI model. Although the writer (W. Michael, 1957) sug- 
gested some time ago that systematic efforts should be directed to- 
ward synthesizing these two approaches in the evaluation of 
achievement of students and although J. Michael (1968) outlined in 
some detail how such a goal could be realized, the unpublished study 
by Schmadel (1960) is one of the few empirical efforts known to the 
Writer. In her sample of school children, Schmadel demonstrated 
that criterion measures of complex achievement tasks which em- 
Phasized the employment of higher level cognitive processes of 
synthesis and evaluation in the taxonomy proposed by Bloom, et al, 
could be predicted with a higher degree of accuracy from those 
measures of hypothesized creative thinking abilities in the SI model 
p concerned with sensitivity to problems, conceptual fore- 
he E fluency, and originality—than from well known 
[ona t tests of achievement and general intelligence, in which 
E oet with convergent production were rep- 
PES x ; is finding of a relationship in the psyehological pro- 
of the a ved in achievement measures built around the constructs 
EIS with those processes in tests developed in terms of 
ain E co model once again points up not only cer- 
in the two nt similarities in the psychological operations involved 
ancing Agia but also the potentialities that exist for en- 
Processes pnm of college achievement, if the psychological 
kaming dun ying the attainment of instructional objectives 1n 
made ag ad les as reflected in the achievement measures can be 
tions un UEM congruent as possible with the psychological func- 
mmary us aptitude tests. 
Serving as m as attempt has been made to show how the SI model 
informational theory of behavior along with the new 
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SIPS model can be employed as a unifying agent in relating the 
constructs of the teaching-learning process to potentially the same 
constructs in the measures used in selection and placement of col- 
lege students as well as in the measures of subsequent scholastic 
achievement and creative endeavor of these students. Within the 
setting of a social psychology of learning in which the mediating in- 
fluences of college environments upon the achievement of students 
are clearly recognized, the SI model and its derivative the SIPS 
model afford a means for defining relative to a specific college en- 
vironment common psychological processes that are reflected in in- 
structional and learning experiences, in admissions and placement 
indices, and in evaluation measures such as achievement examina- 
tions and course assignments. The implementation of a research 
program aimed at increasing the criterion-related validity of selec- 
tion and placement devices through use of the SI and SIPS models 
or of any other workable theory depends upon the coordinated ef- 
forts of a research center on the college campus with which college 
professors, students, and administrators have positive feelings of 
identification and involvement. 
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À major concern of psychometrists, guidance counselors, class- 
toom teachers and other school personnel in the planning of testing 
Programs has been the selection of an achievement battery ade- 
quately measuring the outcomes of instruction. Classroom teach- 
ts in particular often complain about the inadequacy of a test 
battery with respect to their particular subject area. It is the pur- 
Pose of this study to demonstrate the justification of the concern 
alluded to above and to show the results of an attempt to solve the 
problem, 

he study was conducted with the seventh grade students of an 
a school in South Florida. The subjects (N = 355) had a 
ies E ool and College Ability Test (SCAT) score of 74. The test, 
^ Reasoning (AR), from the Differential Aptitude Tests 
ability oim to produce a strictly non-verbal measure of 
ional uos Metropolitan Achievement Test (MAT) was the tradi- 
^g Achievement; battery administered in the county and was 
other vn to the subjects in the experimental school. Various 
Acherg E batteries were examined in great detail by the 

est (AT) e school, who then selected the Stanford Achievement 
closely ap ae being the published achievement test battery most 

Proximating the objectives of instruction in their respective 
1 
apa rere seed while senior author was at Florida State University. 

evelopment Center in Educational Stimulation. 


403 


404 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


content areas. For the outcomes of instruction not represented in the 
SAT, supplementary experimental tests were assembled by the 
teachers from existing pools of items which had proved satisfactory 
for other groups. 

Table 1 presents means and standard deviations for the tests 
given. 

Reasons for factor analysis. A factor analysis of the 24 variables 
listed in Table 1 was undertaken to determine: (1) whether suc- 
cess on MAT scores is related to success on SAT scores, (2) the ex- 
tent to which success in specific subject areas in either achievement 
battery is related to general ability, (3) whether the supplementary 
tests do in fact measure outcomes not provided for in the standard 
achievement batteries. 

Factor analysis procedure. Since the reliabilities of the supple- 
mentary tests were relatively low, communality estimates of unity 
were clearly not called for. When squared multiple R’s (SMR’s) 


TABLE 1 
NOVA 7th Grade 
(N = 355) 
Variables Means SD 
1. MAT Word Knowledge 35.50 10.84 
2. MAT Reading 27.04 9.87 
3. MAT Spelling 33.57 11.79 
4. MAT Language Study Skills 8.21 5.75 
5. MAT Arithmetic Computation 25.19 9.47 
6. MAT Arithmetic Problem Solving 27.27 9.07 
7. MAT Social Studies Information 30.21 11.02 
8. MAT Social Studies Study Skills 22.17 7.61 
9. MAT Science 30.15 10.23 
10. SAT Social Studies 50.23 12.48 
11. SAT Language 95.71 15.88 
12. SAT Science 31.25 9.18 
13. SAT Paragraph Meaning 32.73 10.92 
14. SAT Spelling 27.73 9.82 
15. SAT Arithmetic Computation 16.70 — 11.31 
16. SAT Arithmetic Concepts 20.25 — 11.24 
17. SAT Arithmetic Application 14.33 9.22 
18. Supplementary Experimental English 17.00 5.79 
19. Supplementary Experimental Social Studies 18.73 7.04 
a Suppl aeng Experimental Math 9.78 d 
. Supplement Experimen: i 18.71 . 
22. SCAT Verbal. Stew, 4248 11.27 
23. SCAT Quantitative 31.62 9.84 
24. Abstract Reasoning 30.81 9-10 
24, Abstract Reasoning 80.8 


> Variable numbers on Table 2 and 3 correspond to the variable numbers of this table. 
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TABLE 2 
First Seven Components For NOVA Tth Grade 


Variables I II III IV y VI VII h 


T 81 24 o6 —31 —13 —19 04 87 
2 86 16 o -4 -08 -l 04 8 
3 68 16 27 13  —40 29 —02 79 
4 75 18 24 08  —05 10 —33 78 
5 72 05 26 35 21 10 28 %4 
6 75 08 26 32 28 01 17 8 
7 79 20 17 05 01 0 -1 72 
8 80 16 16 12 06 04 31 8 
9 73 16 18 14 26 | —05 14 70 
10 84  —03 —32  —05  —06 2 -04 8 
1 75  —09  —30 0  —13 18 02 "71 
12 80 -04 -33 —O07 16 -04 -02 78 
13 T4 —07 9-20) 00:42 0E 0T 04 n e 
14 66  —14  -19 05  —38 29 26 79 
15 44  —80 2 -n  -08 03 0 9 
16 48 —78 25  —10 o 4-04 —06 92 
17 43 —80 26  —13 o2 -07 -00 9 
18 755 —02 -33 —04 -0 07  —06 68 
19 722 -0 -30 02 10 08  -07 -63 
20 33 -15  -2 55  —39  —57 03 96 
21 65  —09 —36 03 20 00 01 60 
22 81 27 i0 -36 -07 -20 05 92 
23 81 14 17 —09 10 —12 15 76 
24 55 —-ll  —28 1 19 —-14 -30 55 


* Variable numbers on this table correspond to those of Table 1, where the variable names 


are listed. 


Ited. However, when the 


were used, a non-Gramian matrix resu 
loadings greater than 


SMR’s were used as communality estimates, 
2 in the unrotated factor matrix were found in the first seven col- 


umns. Therefore, it seemed reasonable to assume that there were 


approximately seven significant factors. Unities were placed in the 
d. The resulting communali- 


diagonals and seven factors were rotate 
ties for each variable were then used as communality estimates in 
the original intercorrelation matrix, which was then Gramian. The 
unrotated, principle axes factor solution for this matrix is presented 


in Table 2. 

The Varimax method of rotation 
gonal factor structure presented in Table 3. 

Interpretation of results. One strong, pervasive factor loaded to a 
substantial degree on 19 of the 24 tests (Factor I). High loadings 


on AR and on the SAT and supplementary social studies tests, 
est that this factor may 


which emphasize judgment and process, Sugg 


was used to obtain the ortho- 
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TABLE 3 
Varimaz-Rotated Principle Azes Factor Solution 
(NOVA 7th Grade) 

Variables I II III IV y VI VII i 
1 38  -—07 11 07 —083 —76 -35 88 
2 46  —13 18 05 —12 -67 -3 84 
3 14 —09 20 09 —50 —30 —62 8 
4 28 —il 22 ol —06 -30 -7⁄4 79 
5 25  —14 77 o  —20 Á —21 -29 86 
6 28 | —14 74 08  —04 -25 -35 85 
7 37 —09 27 08 02 —38 —62 79 
8 38  —10 29 04 02  —30 4-67 8 
9 37  -09 59 06 12  —33 -36 75 

10 78  —13 17 0 -14 -23 -27 & 
1 67  —16 ll 07 -35 -21 -23 78 
12 76  —14 19 07 04  -32 -17 77 
13 68  —17 1 08  —19  —34 -14 70 
14 50  —19 12 11 -64 -—18 -16 79 
15 15  —93 11 Ol -14 -07 -06 98 
16 19 ~92 09 05 -03 -08 -10 9 
17 17 —9%4 06 05 00  —o8  —o8 93 
18 70  —10 08 08  -17 -?7 -24 68 
19 70 -n 17 0 —060 -19 -%4 63 
20 19  —09 09 95 -08 -03 -08 97 
21 "1 -1 21 08 01  —16  —09 61 
22 35  —06 14 0 —07 -s -3 9 
23 29  —14 44 00 -16 —64 -%4 88 
24 53  —13 09 15 12 -12 -18 39 


^ Variable numbers on this table correspond to those of Table 1, where the variable names 


be associated with general reasoning ability. The prevalence of this 

factor in the SAT and supplementary tests and the relatively lower 

loadings on the MAT sub-tests suggests that process, judgment, 

and reasoning ability are considerably more important for success 

on these tests than on the MAT. This outcome agrees with the d 

Hie judgments regarding the nature of the two achievement bat- 
ries. 

Another general factor (Factor VI) loaded significantly on 16 of 
the 24 tests, but primarily on the MAT and SCAT. High loadings on 
the MAT word knowledge test and the SCAT verbal suggests that 
this factor represents experience-related verbal ability. The absente 
of loadings on this factor for the SAT and supplementary tests fur- 
ther confirms a priori judgments made about these tests to the ef 
fect that application and process is emphasized over experience 
based knowledge in line with desired outcomes of instruction. A 


| 
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general factor (Factor VII) was very similar to Factor VI and fur- 
ther confirms the belief of the teachers involved that success on the 
MAT is strongly related to experience-based verbal ability. That 
both these factors (Factors VI and VII) also loaded strongly on 
SCAT suggests that these ability measures are strongly biased in 
favor of the student with high experience-based verbal ability. 

One factor interpretable in terms of a specific subject area was 
Factor V, which loaded heavily on both spelling tests (Variables 3 
and 14). Specific factors relating to mathematics were also identi- 
fied, Each mathematics test was associated with a separate factor: 
MAT mathematics—Factor III, SAT mathematics—Factor II, and 
supplementary mathematics test—Factor IV. Thus, it can clearly be 
stated that characteristics resulting in success in this area are not 
the same for each test. A clue to the nature of these differences is 
obtained from noting that Factor III, the MAT mathematics fac- 
tor, also shows loadings on variables 9 and 23, the MAT science 
test and the SCAT quantitative ability test. Thus it seems reason- 
able to assert that the MAT mathematics factor is a broad ability- 
based factor related to a mathematics-science orientation. On the 
other hand, the SAT mathematics factor is exceptionally pure, or at 
least unique having no significant loadings on other tests. The sup- 
plementary mathematics factor is also unrelated to any other test. 

The supplementary tests, with the exception of the mathematics 
test, appear to measure mainly general reasoning ability as repre- 
sented by Factor I. While ability to analyze and deduce as repre- 
sented by Factor I is an objective in the curriculum, the supple- 
mentary English, social studies and science tests seem to have no 
discernable relationship to any specific subject area. However, this 
was also the case for the corresponding MAT and SAT tests, all of 
which loaded primarily on general factors. Thus, while such tests 
undoubtedly measure subject area competence in some respect, suc- 
cess on them seems to be substantially related to general ability. 

All of the above observations may be summarized as follows: Suc- 


cess on the MAT, SAT, and supplementary achievement tests was 
strongly related to general ability. However, factors associated with 
rtant for the MAT. A factor 


general verbal ability were more impo: 
associated with analytic or deductive reasoning ability was more 
Prevalent for the SAT and supplementary. tests than for the MAT. 
This tendency to account for success in terms of general ability was 
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reversed in the case of spelling and mathematics. A single spelling 
factor accounts for success on both the MAT and SAT spelling tests, 
but three separate mathematics factors were identified. Those for 
the SAT and supplementary mathematics tests were quite distinct. 
The MAT mathematics factor, however, which also contributed to 
success on the MAT science test, is thus interpreted as a more gen- 
eralized, nonanalytic mathematics factor. 

In general then, it may be stated that the objectives of this pro- 
gram to select and prepare an appropriate achievement battery for 
the curriculum of the school was successful. However, the lack of 
differentiation between the factors responsible for success in the 
areas of language, social studies, and science leave unanswered ques- 
tions about the validity of these tests. 
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UNDERGRADUATE ABILITY FACTORS 
IN RELATIONSHIP TO 
GRADUATE PERFORMANCE! 


ALBERT MEHRABIAN 
University of California, Los Angeles 


Tur present study investigated the relationships among a series of 
criteria which can be employed in considering admission of stu- 
dents in graduate psychology programs. Further, the validity of 
these criteria for predicting graduate level performance was as- 
sessed. The following set of variables subsumes most of the criteria 
which are employed in considering candidates for graduate work in 
psychology: the Graduate Record Examination (GRE) Aptitude 
Tests and the GRE Advanced Test in Psychology; the Miller 
Analogies Test (MAT); undergraduate grade point average; grade 
point average in the junior and senior years of undergraduate work; 
increase in the grade point average in the last two years over the 
first two years of undergraduate work; number of mathematics and 
logic courses taken; rating of the department which the student 
attended as an undergraduate; sex of the student; amount of re- 
Search experience of the student as an undergraduate; an average 
assessment of the overall promise of the student as a graduate stu- 
dent; and a rating of the student’s research in contrast to service 
orientation—the latter three being assessed from letters of recom- 
mendation and/or biographical statements of the student. 

There is research evidence relating scores on the MAT and the 
GRE, as well as undergraduate grade point average, to graduate 
school performance. The evidence relating to the MAT obtained in 
studies by Cureton, Cureton and Bishop (1949), Kelley and Fiske 
= 

lTk. - i ice t 
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or their assistance in running the study, and the psychology staff at UCLA 
for their cooperation in providing the graduate student performance ratings. 


409 


410 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(1951), Fahey (1953), Jensen (1953), Walters and Paterson (1959), 
and Platz, McClintock, and Katz (1959) provides support for the 
predictive validity of the MAT for graduate school performance in 
psychology. However, in one study conducted by Hymon (1957) 
the MAT did not significantly relate to either undergraduate or 
graduate grade point average or grades in research-oriented gradu- 
ate courses. 

In one study employing the GRE tests, Capps and DeCosta 
(1957) found that of four variables—undergraduate grade point 
average, the two Aptitude scores from the GRE and the GRE Ad- 
vanced Test scores—the GRE Advanced scores were the best pre- 
dictors of graduate core course grades. 

Undergraduate grade point average has been found to relate to 
graduate grade point average and research potential. In an early 
study, Cureton, Cureton, and Bishop (1949) reported a .77 correla- 
tion between undergraduate grade point average and professors’ 
ratings of graduate students’ overall potential and promise in psy- 
chology. Capps and DeCosta (1957) found a 0.42 correlation be- 
tween undergraduate GPA and graduate GPA. Also Platz, Me- 
Clintock, and Katz stated that undergraduate GPA, as well as 
undergraduate GPA in science and mathematics, correlates signifi- 
cantly with graduate GPA. However, the latter, but not the former, 
also correlates with professor’s ratings of the potential scientific 
contribution of graduate students. Finally, Jensen (1963) noted that 
undergraduate GPA is not always a significant predictor of first 
year graduate scholastic achievement, and that an objective test 
such as the Iowa Mathematics Aptitude Test tends to be a better 
predictor of scholastic achievement in graduate school. 

In sum, the evidence suggests that undergraduate grades and 
Scores on content examinations, such as the GRE Advanced Testy 
tests of mathematical ability and grades in mathematics courses 
during undergraduate work, as well as tests of verbal ability such 85 
in the verbal portion of the GRE or the MAT, can all partially pre 
dict the performance of graduate students in psychology. There does 
not seem to be any evidence, however, relating letter of recommen- 
dation ratings or research experience of undergraduates to thei" 
graduate school performance. 2 

The aims of the present study were two-fold: first, to characterize 
some of the ability factors based on the criteria employed in consid: 
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ering candidates for graduate school; and second, to obtain the re- 
lationships between the various admission eriteria and graduate 
school performance in psychology. Thus, in the first part of the 
study, unselected applicants for the graduate psychology program at 
UCLA were all rated on each of the 13 admission criteria and these 
criteria were subsequently factor analyzed. In a second part of the 
study the admission criteria scores of psychology graduate students 
were all related to three indices of graduate psychology perform- 
ance. The latter consisted of an average evaluation of the research 
competence and promise as a research worker of the student by 
those professors who had had contact with the student in a research 
setting; average grades of the student in content courses during his 
first year of graduate work; and average grades in first year statis- 
tics courses in graduate school. 

Method—Subjects. There were 266 subjects in the sample of can- 
didates who had applied for admission to the graduate psychology 
program at UCLA. These cases were selected on the basis of avail- 
ibility of ratings on all of the criteria noted below. The latter sample 
Will be referred to as the “candidate” sample and will be distin- 
guished from the following sample of graduate students. The grad- 
uate student sample consisted of 79 students presently enrolled in 
the psychology program at UCLA; their performance in graduate 
school was related to the admission criteria. 

Procedure. The following measures were obtained for each candi- 
date based on information available from the candidate’s admission 
file: scores on the Verbal and the Quantitative portions of the Ap- 
titude Test of the Graduate Record Examination (GRE); score on 
the Graduate Record Examination Advanced Test in Psychology; 
score on the Miller Analogies Test (MAT); overall undergraduate 
grade point average (GPA); GPA during junior and senior years; 
increase in GPA in the last two, relative to the first two, years of 
undergraduate work; the number of mathematics and logic courses 
taken during undergraduate work; the rating of the graduate school 
faculty of the psychology department which the candidate at- 
tended as an undergraduate (Cartter, 1966, p. 56); research ex- 
Perience of the candidate; overall promise of the candidate as a re- 
search psychologist; and research versus service orientation of the 
candidate. 


The above measures were obtained as follows. In the case of the 
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GRE and MAT scores, percentile seores were used. GPA Scores 
were recorded on a 5-point scale ranging from zero to four. The 
rating of the school, taken from Cartter (1966), was translated into 
a numerical scale as follows: “distinguished” departments were 
assigned a score of 4; “strong” departments were assigned a score of 
3; “good” departments were assigned a score of 2; and “adequate” 
departments were assigned a score of 1. For departments not in- 
cluded in the ratings provided by Cartter (1966), the rating of 
adequate, that is, a score of 1 was assigned. 

The research experience of a candidate was assessed on a 4-point 
scale including 0, for no research experience; 1, corresponding to 
one laboratory course; 2, corresponding to one year of supervised 
research experience; and 3, corresponding to two or more years of 
supervised research experience. 

The overall promise of a candidate was rated on a 10-point scale 
which ranged from 9: promises to graduate from graduate school 
with distinction and become a major research contributor in his 
area; 8: promises to graduate from graduate school with distinction 
and become a distinguished and active contributor in his area; 7: 
promises to graduate from graduate school with distinction and be- 
come a moderately active contributor in his area; 6: promises to 
graduate from graduate school with distinction and become a minor 
contributor in his area; 5: promises to graduate from graduate 
School with distinction but with uncertain potential as a research 
Worker; 4: promises to graduate from graduate school with uncer- 
tain potential as a research worker; 3: promises to graduate from 
graduate school with no potential as a research worker; 2: will most 
probably get a doctorate ; 1: will probably get a doctorate; and 0: 
is unlikely to get a doctorate. In addition to the candidate's m 
search potential, his research orientation was assessed from his bio- 
graphical statement as well as his letters of recommendation, with 
scores ranging from —3 corresponding to extreme service orientation 
to +3 corresponding to extreme research orientation. 

For each candidate two additional indices were available from the 
action of the admissions committee who studied the folder of the 
candidate. The committee rated each candidate on a 6-point scale 
on the basis of an intuitive judgment of the information available 
for that candidate, A committee “evaluation” score was the mean 
rating based on independent judgments of at least two committe? 
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members. The scale used was as follows. 0: not acceptable; 1: would 
barely make it through our program; 2: would probably get through 
our program but why knock ourselves out; 3: acceptable in terms 
of our present students; 4: desirable, would add to the department; 
5: outstanding, accept immediately. A second index was the actual 
acceptance and non-acceptance letters mailed to the candidates. 
These were rated in the present study as follows. 0: student received 
a letter indicating that he was not accepted; 1: not accepted, placed 
on a waiting list but subsequently rejected; 2: not accepted, placed 
on a waiting list and subsequently accepted; 3: accepted; 4: ac- 
cepted and given some indication of possible financial support; 5: 
accepted and offered possible financial aid, but the student decided 
to go elsewhere; 6: accepted and given definite promise of financial 
assistance; and 7: accepted and offered definite promise of financial 
assistance, but student chose to go to another school. 

In the second section of the study, our present graduate students 
were used as subjects. Their graduate school performance was re- 
lated to the preceding criteria, and the following set of additional 
measures were also obtained: (1) an average rating of the student 
on the same 10-point scale reported above for assessing the overall 
research promise of a student—this scale was completed by all of 
those members of the faculty who had had extensive research con- 
tact with the student; (2) the average numerical grade of a student 
in all the content courses which he had taken in his first year; and 
finally (3) an average grade for all of the statistics courses which 
the student had taken his first year. 

Results, The 13 variables on which each subject in the sample was 
assessed were factor analyzed and a principal component solution 
was obtained. ‘There were 6 factors with eigenvalues greater than 
unity, and these factors accounted for 75 per cent of the total vari- 
ance. Varimax rotation of the first six factors yielded the following 
groupings of the criteria. For each factor reported below the criteria 
are listed in terms of decreasing loadings on the particular factor. 

GRE and MAT factor: this factor was defined solely by the 
GRE Verbal, the MAT, the GRE Advanced, and the GRE Quanti- 
tative scores. 

Grade point average facto 
dergraduate GPA and GPA in the last two 
work, 


r: this was defined by the overall un- 
years of undergraduate 


414 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Research orientation factor: this was defined by the extent of re- 
search experience, a research in contrast to a service orientation, 
and finally by the letter of recommendation rating of the candidate 
regarding his potential as a graduate student and research worker in 
the field of psychology. 

Grade point average improvement factor: this was defined by the 
inerease in the GPA of the student in the last two years over the 
first two years of undergraduate work. 

Sex factor: this was defined solely by the sex of the candidate. 

Mathematical training factor: this was defined by the total num- 
ber of mathematics and logie courses which the student had taken 
and by a relatively low rating of the psychology program which the 
student attended as an undergraduate. 

An additional factor analysis was performed in which the evalu- 
ation rating of the admissions committee and the committee's at- 
ceptance decision were also included. The latter factor analysis 
yielded factors identical to the ones already reported. However, it 
was found that the ratings of the admissions committee as well a8 
the acceptance decisions loaded on the first factor. In other words, 
this second factor analysis indicated that the scores on the GRE 
and the MAT mainly determined the rating of a student’s promise 
as a graduate student in psychology by the admissions committee 
and their subsequent acceptance decisions. : 

In a second set of analyses, the correlations of the admissions cri- 
teria with the three criteria of graduate school performance of each 
of the 79 graduate students were obtained. In reporting these corre- 
lations, the criteria of student sex, increase in the GPA of the last 
over the first two years of undergraduate work, the rating of the de- 
Partment which the student attended as an undergraduate, and his 
Tesearch experience are excluded, since these did not relate signifi- 
cantly to any of the three indices of graduate school performance. 
Table 1 contains the intercorrelations among the remaining criteria 
and correlations of these with the graduate school performance 
measures. With df = 77, correlation coefficients exceeding .222 at 
significant at the .05 level. i 

Although some of the variables included in Table 1 do not signifi- 
cantly correlate with graduate school performance criteria, their 
inclusion in Table 1 provides some information about the differen- 
tial effectiveness of the criteria in predicting graduate school pet 
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TABLE 1 


Intercorrelation Matrix Relating Admission Criteria to Graduate Performance Criteria* 


fariables 2 Sa. 7 8 9 10 3101 70213 
L GRE Verbal Al 44 .70 19 .18 —.06 .17 00 .42 .23 —.14 .17 
23 


2, GRE Quantitative 44 43 .16 .13 


|3. GRE Advanced 53 48 18 Jl 34 13 45 31 48 53 


fi. Overall GPA 45 
6, Last two years’ GPA 


T. Number of math and logic "5 20. .08 12 0. AB 


9. Letter B (res. orient.) 01 .13 A9 3l 
10. Evaluation no. 43  .10 22 
ll, Acceptance no. 30 .21 


Prof. evaluation of grad. achievement. 
. Grad, Content grades 
M. Grad. Statistics grades 


*r» 22, p < 05. 


formance. For example, the last two years’ GPA exhibits a stronger 
relationship to graduate school performance than overall GPA. 
Again, with the exception of the GRE Verbal index, all other GRE 
and MAT indices significantly relate to graduate school perform- 
ance, with the GRE Advanced scores relating most consistently and 
strongly. 

One final analysis of the latter data was carried out as follows. In 
an attempt to compute a regression equation which can be used to 
estimate the overall quality of a student's graduate school and re- 
search performance promise, a stepwise regression analysis was 
performed on the following variables. The dependent variable was 
an overall index of graduate school performance which was the sum 
of the z-scores of a given subject’s three indices of graduate school 
performance. Also, a single index for the GRE and MAT tests was 
computed for each student by simply summing his percentile scores 
on all four of the tests—henceforth referred to as the GRE-MAT in- 
dex. The last two years’ GPA rather than the overall GPA was em- 
Ployed, since it exhibits a stronger relationship with the perform- 
ance criteria. The remaining criteria were left unchanged and simply 
entered into the regression analysis. The results of the regression 
analysis for the first five indices entered into the equation were as 
follows: 


Graduate school performance = 134 (GRE-MAT index) + 105.7 
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(letter of recommendation rating) -- 22.5 (research orientation 
rating) + 18 (number of math and logic courses) + 91 (last two 
years’ GPA). 

Table 2 includes statistics for the various independent measures 
which were successively entered into the regression analysis. For 
each independent effect, the mean, the standard deviation, the range 
of possible scores, and the correlation with the overall graduate 
school performance index are reported. Furthermore, in the last 
column of that table, the multiple correlation coefficient with the 
dependent variable is reported for that step in the regression analy- 
sis. Thus, for instance, the coefficient equalled .507 when both the 
GRE-MAT and the letter of recommendation ratings were used in 
the equation. The equation reported above includes only the first 
five independent variables due to diminishing returns with the addi- 
tion of the sixth and remaining variables. 

Discussion. The major thrust of the findings from the first sample 
of unselected candidates is that the Graduate Record Examination 
Aptitude and Advanced Tests, together with the Miller Analogies 
Test, define an ability factor which, incidentally, forms the basis for 
admissions judgments of the UCLA Psychology Faculty. Of the re- 
maining ability factors for the unselected candidate sample, the 


TABLE 2 


Statistics Relating to Variables Employed in the 
ise Regression Analysis 


Correlation $ 

Possible with Overall Multiple 

E Standard Range of Graduate School Cones 
Variable Mean Deviation Scores Performance Index Coeffci 


GRE-MAT index 317.6 64.6 0,396 434 d 
Letter of 

Recommendation 

Rating 49 0.6 09 356 507 
Research 

Orientation 2.0 2.1 ~33 .172 bs 
No. Math & 

Logic Courses — 3.4 — 2,8 Qandup 191 ieo 
Last Two Years’ 

GPA E AY .217 565 
Department 
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candidate sex factor and the “increase in the last two years’ under- 
graduate GPA over the first two years’ GPA” exhibit the weakest 
relationships with graduate performance criteria and therefore 
would not seem to be as relevant for the characterization of candi- 
dates. Thus, in addition to the GRE-MAT index, there are three 
other candidate ability factors, namely, a grade point average, a 
research orientation and a mathematical training factor. The latter 
factor is defined by the number of mathematics and logic courses 
taken during undergraduate work as well as by a relatively low 
rating of the school which the candidate attended as an undergradu- 
ate. These suggest that candidates who have attended less distin- 
guished psychology programs have had much more extensive course- 
work preparation at least in mathematics and logic than have appli- 
cants from the more distinguished departments, who are perhaps 
more representative of average students in undergraduate programs 
in those departments. 

Results from the second segment of the study help further clarify 
the four aforementioned ability factors, that is, the GRE-MAT in- 
dex, the GPA, research orientation, and mathematical training fac- 
tors, For example, the finding that the last two years’ GPA exhibits 
a stronger relation to the overall graduate school performance index 
than the overall GPA, suggests that the former be used as a sum- 
mary index of GPA. The research orientation factor was defined in 
terms of letters of recommendation, research orientation, and re- 
search experience ratings. Of these three indices, the first two exhibit 
significant correlations with graduate performance and, according to 
the regression analysis, seem to be sufficiently different to merit 
separate treatment in attempts at the prediction of graduate school 
Performance. (A stepwise regression analysis is such that at each 
successive stage the variable entered is the one which makes the 
greatest reduction in the error sum of squares. Thus, if there are two 
or three correlated variables, then one of these may be entered into 
the regression equation and serve essentially as à substitute for the 
remaining ones. However, because of the fact that the regression 
analysis yielded the letter of recommendation and the research 
versus service orientation ratings as two successive items in the 
equation, these two variables would seem to warrant separate treat- 
ment as indices of a student’s promise as à research worker.) Finally, 
for the last, mathematical training, factor, the number of mathe- 
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matics and logie courses seems to suffice as an index of a candidates 
ability defined by the factor. : 

The above ability factors, which were obtained for an unselected | 
sample of candidates, surprisingly enough continue to exhibit con- 
sistent relationships with graduate school performance even after 
extensive selection from the candidate sample which tends to make 
the graduate student sample a much more homogeneous one. Thus, 
considering the homogeneous quality of the sample of graduate stu- 
dents, the obtained relationship between the GRE-MAT index and 
the overall graduate school performance is quite impressive and 
would indicate that this index defines an ability factor that has 
validity in predicting graduate level performance. The next strong- 
est predictor of graduate level performance is the letter of recom- 
mendation rating which may be more readily obtained from re- 
spondents by essentially presenting them with a scale such as the 
one employed in the present study. Respondents could also provide a 
service-research orientation rating for each candidate on a —3 to 
+3 scale such as the one used in this study. 

In sum, the findings of the present study are generally consistent 
with existing findings, but also provide information about additional 
criteria such as letter of recommendation ratings which are generally 
used in assessing candidate ability but whose validity has not been 
explored, Also, results of the regression analysis provide a basis for 
differential weighting of the various criteria in attempts at assessing 
a candidate’s promise in graduate studies. 
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PREDICTION OF ACADEMIC AND 

NONACADEMIC ACHIEVEMENT IN 

TWO-YEAR COLLEGES FROM THE 
ACT ASSESSMENT 


LEONARD L. BAIRD 
American College Testing Program 


Tum growing importance and complexity of two-year colleges 
emphasizes the need for comprehensive information about two- 
year college students. The purpose of the present study is to examine 
the prediction of academic and nonacademie achievement in two- 
year colleges. Specifically, the validity of the ACT assessment is 
evaluated by predicting college grades and nonacademic achieve- 
ments for a large sample of two-year college students. The ACT 
assessment consists of the ACT tests and the Student Profile Sec- 
tion, a short biographical inventory providing information about 
students’ educational and vocational plans, campus needs, and non- 
academic achievements. The ACT tests have been validated many 
times (American College Testing Program, 1965, 1966) and their 
validity in two-year colleges has been established (Hoyt and Mun- 
day, 1966). The validity of the nonacademic scales has also been 
studied extensively (Richards, Holland, and Lutz, 1967b; Richards 
and Lutz, 1968). The present study is different from earlier studies 
in that it is concerned only with two-year colleges and in that it 
deals with the entire college career rather than just the first year. 

Method—Predictors. The predictive variables included the follow- 
ing measures: 

ACT tests. The ACT tests are tests of academic aptitude which 
yield subtest scores in English, mathematics, social studies, and 
natural science. Each score is converted to a common scale with a 
Mean of approximately 20 and a standard deviation of about five for 
College-bound high school seniors. The four subtest scores are av- 
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eraged to yield a Composite score (American College Testing Pro- 
gram, 1965). 

High school grades. As a regular part of the ACT procedure, per- | 
sons taking the ACT battery report the grades they have received 
in high school courses in four areas: English, mathematics, social - 
studies, and natural science. Research by Davidsen (1963) and by 
Hoyt (1963) indicates that such self-reported grades correspond 
closely to high school transcripts and predict college grades as well 
as high school reported grades. The measure used in the present study 
is the overall average on a four-point scale (A = 4, B = 8, ete.) of 
all grades reported. 

Nonacademic achievement scales. A checklist of extracurricular 
accomplishment was developed to yield scores in the following areas: 
leadership, music, drama and speech, art, writing, and science. Each 
scale consisted of eight items ranging from common and less impor- 
tant accomplishments to rare and more important ones. For exam- 
ple, science items included such accomplishments as “performed an 
independent scientific experiment” and “won a prize or award of any 
kind for scientific work or study.” In general, the accomplishments 
involve public action or recognition, so that in principle the accom 
plishments could be verified. The score on each scale is simply me 
number of accomplishments the student marks “Yes, applies to me 
Students with high scores on one or more of these simple scales pre- 
sumably have attained a high level of accomplishment which re- 
quires complex skills, long-term persistence, or originality. These 
scales are discussed in detail elsewhere (American College Testing 
Program, 1965; Holland and Richards, 1967). 

Criteria of achievement. The criterion variables included the fol- 
lowing measures: 7 

College grades. Each student reported his grade average for bis 
last college term by checking one of the following alternatives: Dor 
lower, D+, C, C+, B, B+, A or A+. Scores from one to seven Wer? 
assigned to these alternatives so that a high score indicates ko 
grades. In an earlier study (Richards and Lutz, 1908) these self- 
reported grades correlated .84 to .87 with college-reported GPA. ^ 
addition, the colleges in this sample were asked to report the gon 
average for each student on a standard four-point scale where He 
400, B = 3.00, C = 2.00, and D = 1.00, and a failing grade = 9 


Nonclassroom achievement record. We used a checklist of nona?" 
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demic accomplishments to measure achievement in the following 
areas: leadership, social participation, art, social service, science, 
business, humanities, music, writing, social science, and speech and 
drama, We also developed a simple scale to determine public recog- 
nition for academic attainment in college. These scales are very sim- 
ilar to the high school achievement scales, except that the items refer 
to the college years. A detailed account of the rationale, develop- 
ment, and statistical characteristics of these scales is presented else- 
where (Richards et al., 1967a, 1967b). 

Each scale includes ten items, except the Recognition for Aca- 
demie Accomplishment Scale, which has five items. Items range 
from common and less important accomplishments to rare and 
more important ones. For example, music accomplishments included 
these: “composed or arranged music which was publicly performed," 
“publicly performed on two or more musical instruments,” “attained 
a first division rating in a state or regional solo music contest.” The 
remaining scales consisted of similar items with content appropriate 
to the various areas of achievement. 

Sample. The sample consisted of 1628 men and 1079 women stu- 
dents completing their second year at 27 two-year colleges. These 
students, who had taken The American College Testing Program 
(ACT) Battery of college admission tests before entering college, 
were followed up in the second semester of their second year as part 
of a comprehensive evaluation of the validity of the ACT assess- 
ment (Baird, Holland, Richards, and Shevel, 1969). Comparisons of 
students who completed the follow-up questionnaire with those still 
attending the same college but not completing the follow-up indi- 
cated that the differences were small and unlikely to bias the overall 
results, The environmental characteristics of the colleges were stud- 
ied, using the scores computed by Richards, Rand, and Rand (1960). 
This study suggested that the averages of the sample colleges were 
reasonably close to the averages of all two-year colleges and also 
reflected the diversity in such colleges (Baird, Holland, Richards, 
Shevel, 1969). Thus, the sample of institutions and students seemed 
reasonably representative of American two-year colleges. 

Infrequency scales. Two “infrequency” scales were developed to 
check the student’s tendency to exaggerate his achievements, one 
for high school achievements and one for college achievements (Hol- 
land and Richards, 1966; Richards and Lutz, 1968). These scales 
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consist of the items on each achievement scale claimed least fre 
quently. Students who claim all these rare achievements may be 
exaggerating their achievements. Therefore, students who scored 
more than four on either or both Infrequency scales were eliminated 
from the analyses. A total of 18 men and nine women were elimi- 
nated by this procedure. 

Results. We computed the means and standard deviations of low 
infrequency men and women on the college criterion measures. We 
then computed the intercorrelations between predictive and criterion 
measures. Since not all students completed the questionnaire, we 
used a missing data computer program. Then, using a stepwise mul- 
tiple regression program, we determined the best subset of predictors 
for each criterion measure. Because the N is large, many variables 
which had no practical effect on the size of the multiple correlation 
would produce a statistically significant reduction of variance. 
Therefore, we retained only those variables which increased the 
multiple correlation by at least 01. In earlier research with large 
samples, this criterion has been more stringent than using a .01 sig- 
nificance level for reduction in residual variance. 

The mean and standard deviations of the college achievement 
scales for men and women are shown in Table 1. In comparing col- 
lege and student GPA’s, it must be remembered that colleges used à 
4-point scale while students used a seven-category item. Therefore, 
both means are between C and C+. Achievement in most nonaca- 
demic areas is fairly rare. The standard deviations are larger than 
the means because the nonacademic achievement scales are highly 
skewed. (Studies by Holland and Richards, 1965, show this skew- 
ness to have little effect on the pattern of results.) : 

The intercorrelations between predictors and criteria are shown 1M 
Table 2. Results for males are above the diagonal, females eim 
In general, the correlations among measures of academic potentis 
and performance are moderate, and the correlations among 10% 
classroom achievements in the same or closely related areas are also 
moderate. There is little relationship between nonclassroom aprile 
ment in areas which are not closely related and between nonclass 
Toom achievements and measures of academie potential and y 
formance. These relationships are consistent with what previous jo 
—— 


3 " llege 
* This program was developed by Douglas Whitney of The American ca 
Testing Program. 
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TABLE 1 
Means and. Standard Deviations for College Achievement Scales 


Men Women 
Mean SD Mean SD 

College Rept GPA 2.36 .56 2.02 .60 
Student Rept GPA 3.96 1.11 4.41 1,23 
Recognition for Academic Accomp. .22 .58 .36 -70 

ad Ach -69 1.56 1.15 1.84 
Social Participation .85 1.44 .82 1.20 
Art Ach .52 1.16 .85 1.52 
Soc Serv Ach .61 1.18 1.07 1.34 
Science Ach .25 .66 .06 81 
Business Ach .82 1.03 E .07 
Humanistic Ach .94 1.28 1.27 1.34 
Music Ach .22 .81 .26 17 
Writing Ach +32 .85 .58 1.07 
Social Science Ach. .36 71 .82 .60 
Drama Ach .90 .83 .89 1.03 


~~ Note—Colleges reported GPA's on a standard four-point scale, while students used a soven: 
alternative item. Therefore both means are between C and C+. 
vestigators have found (Holland, 1961; Hoyt, 1966; Nichols and 
Holland, 1963; Holland and Richards, 1965; Richards et al., 1967b; 
Richards and Lutz, 1968). The high correlations between college- 
reported and student-reported college GPA suggest that most stu- 
dents give a frank account of their accomplishments. 

The results of the multiple correlation analyses for the academic 
criteria—college GPA and Recognition for Academie Accomplish- 
ment—are shown in Table 3. For all measures of academic accom- 
plishment the best predictor is high school grades. The ACT test 
scores add slightly to the prediction. This finding is consistent with a 
large number of previous investigations of the prediction of aca- 
demic performance (American College Testing Program, 1965). : 

The results of the multiple regression analyses for college achieve- 
ments in areas which reflected accomplishments similar to those 
assessed in high school are shown in Table 4. 1 

With one exception, the best predictor of achievement in college is 
similar achievement in high school. In most cases the prediction of 
Nonacademic accomplishment is improved only slightly by adding 
variables to the corresponding high school scale, These findings are 
consistent with a substantial literature showing that past perform- 
ance in specific areas predicts future performance in those areas. 
The information in Table 4 also confirms earlier findings that 
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.. .. TABLES 
Beta Weights and Multiple Correlations for Predicting 
Academic Accomplishment 
Men 

Criterion Predictors Beta R 
College GPA HS GPA .972 .45 
ACT Soc Studies .140 .49 
ACT English Bi .50 

Recognition for Academic 
Accomplishment HS GPA .259 .30 
ACT English .82 

Women 

College GPA HS GPA .504 .59 
ACT Nat Sci .123 61 
ACT English .122 .02 

Recognition for Academic 
Accomplishment HS GPA -290 +38 
ACT English +145 AL 
ACT Math .094 .42 

TABLE 4 


Bela Weights and Multiple Correlations for Predicting Criteria of Nonacademic 
Accomplishment Highly Comparable to the High School Achievement Scales 


Men Women 
Criterion Predictors Beta R Predictors Beta R 
College Lead Ach HS Drama Ach .135 .22 HS Lead Ach .156 .25 
HS Writing Ach .125 2 HB GPA K aM 40 
ad Ach i5 . usie Ac! 1 X 
ga EIU HS Writing Ach .114 .35 
College Music Ach HS Music Ach ^ .872 .38 HS Music Ach — .387 .41 
HS Writing Ach .005 .39 HS Drama Ach .089 .42 
CollegeSp DrAch HS Drama Ach  .218 .27 HS Drama Ach .289 .32 
HS Writing Ach  .158 .31 HS Writing Ach —.086 .33 
College Art Ach HS Art Ach .420 .44 HS Art Ach 610. .61 
HS Science Ach .123..46 
College Writ Ach ^ HS Writing Ach  .940 .95 HS Writing Ach .383 40 
ACT Soc Studies .096 -36 ACT English .095 .41 
College Sci Ach HS Science Ach  .301. .82 HS Science Ach  .128 .13 
i 7106 .34 ACT Soc Studies .086 .16 
ACT Nat Sei 1 AENA ET 


lo EB cee dod 
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academic potential and success contribute little or nothing to the 
prediction of non-classroom success (Astin, 1962; MacKinnon, 
1960; Torrance, 1963; Price, Taylor, Richards, and Jacobsen, 1964), 

The criteria which are not highly comparable to the high school 
achievement scales include the following: social participation, social 
service, business, humanistic-cultural, and social science achieve- 
ment. Table 5 shows the best predictors of these criteria of college 
achievement. 

The multiple correlations in Table 5 are generally somewhat 
lower than the correlations in Table 4. (Social Participation and 
Humanistic-Cultural achievement are predicted better than other 
areas.) If we had constructed high school achievement scales cor- 
responding closely to the college achievement scales, the level of 
prediction would probably have been higher. For the most part, the 
correlations in Table 5 support the conclusion that academic pre- 
dictors have little relation to a student's potential for non-classroom 
accomplishment. 

Discussion. The present study indicates that it is possible to pre- 


TABLE 5 
Beta Weights and Multiple Correlations for Predicting Criteria of Nonna 
Accomplishment Not Highly Comparable to the High School Achievement Seal 
(High Infrequency Students Eliminated) 
wa hl ll — 
Men Women R 
Criterion Predictors Beta R Predictors Beta 


College Social Par HS Writ Ach .141 .25 HS Lead Ach -176 d 
HS Lead Ach .127 .30 HS Writ Ach -157 EA 
HS Science Ach .125 .32 HS Drama Ach .13 . 
HS Drama Ach  .105 .34 


College Soc Ser Ach HS Writ Ach ^ .137 .20 HSleadAch -H0 aa 
HS Lead Ach ‘114 .23 HS Drama Ach 17: 


HS Science Ach .095 .25 HS Art Ach Jon " 
College Bus Ach HS Science Ach .138 .17 HS Drama Ach O77 + 
HS Writing Ach .084 .20 HS Art Ach “069 .16 

HS Lead Ach  .064 .21 HS Lead Ach = «009 | 

College Hum HS Writ Ach 199.27 HS Writ Ach EC j^ | 
Cult Ach ACT Soc Studies .148 .31 ACT Soc Studies «175 sg 
HS Drama Ach  .104 .33 HS Art Ach “130 38 


HS Science Ach .092 .34 HS Drama Ach 
122 -11 


College Soc Sci Ach HS Writ Ach .166 .22 HS Writ Ach an 
HS Science Ach .105 .25 HS Drama Ach “066 2 | 


HS Lead Ach .082 .26 HS Art Ach 
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dict academic and nonacademic achievement in two-year colleges 
with moderate success. Although some areas of nonacademic 
achievement were not predicted very successfully, most correlations 
were high enough to provide useful information about potential for 
college achievement. Of course, usefulness depends on the decisions 
made on the basis of the information. Baird and Richards (1968) 
showed that the high school nonacademic achievement scales can be 
used to admit students who will be especially likely to be college 
nonacademic achievers. However, any selection procedure has its 
costs as well as its gains. 

In summary, these results suggest that academie and nonaca- 
demic achievement at two-year colleges can be predicted to a useful 
degree by the information provided in the ACT assessment. 
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THE PREDICTION OF CREATIVE ACHIEVEMENT FROM 
A BIOGRAPHICAL INVENTORY? 


CHARLES E. SCHAEFER 
Fordham University 


IN a recent review article, Freeberg (1967) summarized the 
literature dealing with the use of biographical information to pre- 
dict student achievement. Since biographical questionnaires seem 
to measure a wide variety of factors—both intellectual and non- 
intellectual, Freeberg concluded that they perform at their best 
and surpass other predictors when complex criteria are used, such 
as creative achievement. Although there have been numerous stud- 
ies concerned with the relationship of biographical data and aca- 
demic success in terms of grades (Asher and Grey, 1940; Myers, 
1952; Schaefer, 1963), relatively few biographical studies (Anastasi 
and Schaefer, 1969; Schaefer and Anastasi, 1968) have focussed 
on the prediction of student creative achievement. 

The present study was designed to investigate the validity of a 
biographical inventory in combination with other instruments in 
the prediction of creative achievement at the high school level. 
The instruments selected for comparison purposes consisted of sev- 
eral of the more commonly used perceptual, personality, and 
achievement measures of creativity. 

Method —Subjects. The subjects were 800 male and female stu- 
dents from 10 high schools in the New York metropolitan area? 


1The present study is part of a larger project supported initially by Grant 
No. MH 10233-01 pes the National Institute of Mental Health and subse- 
quently by Subcontract No. 2 of the Center for Urban Education's Contract 
OEC-1-6-062868-2039 with the United States Office of Education. 

2 More detailed information on the subjects, procedures, and results can be 
found in Schaefer and Anastasi (1968), and Anastasi and Schaefer (1969). DU 
Participating high schools were Abraham Lincoln, Art and Design, Bronxvil e, 
Erasmus Hall, Forest Hills, Jamaica, Midwood, Music and Art, Regis, an 
Stuyvesant. 
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The schools were chosen because they offer courses or programs 
providing opportunities for creative achievement. The subjects 
included 434 seniors, 272 juniors, and 94 sophomores. The group 
as a whole was superior in terms of academic achievement and 
intelligence test scores. 

Creativity criteria. The subjects were assigned to eight criterion 
groups of 100 students each according to the following three-fold 
system of classification: achievement (creative or control), field of 
achievement (artistic or scientific), and sex. For the boys the 
artistic field included both art and writing, while the scientific 
field encompassed the natural sciences and mathematics. Since rel- 
atively few girls were found to show scientific achievement, the 
girls’ specialty fields were classified as either art or writing. The 
eight criterion groups were designated: Creative-Artistic-Boys (Cr- 
A-B), Control-Artistie-Boys (CoA-B), Creative-Scientific-Boys 
(CrS-B), Control-Scientific-Boys (CoS-B), Creative-Art-Girls 
(CrA-G), Control-Art-Girls (CoA-G), Creative-Writing-Girls 
(CrW-G), and Control-Writing-Girls (CoW-G). 

The criterion employed for assessing creative achievement was 
a combination of teachers’ evaluations and creativity test scores. 
Thus, all the creative subjects were nominated by their teachers 
on the basis of one or more creative products described on a nomi- 
nation form. They also scored above a cutoff point on Guilford’s 
Alternate Uses and Consequences tests. The control subjects were 
nominated by the same teachers as having shown no evidence of 
creativity, and they scored below a cutoff point on the two Guilford 
Screening tests. Within each field, the creative and control subjects 
were matched in school attended, grade level, classes in whic 
enrolled, and grade-point-average. 

Instruments. A comprehensive 165-item Biographical Inventory 
(BI) was constructed for this study. The questions were formu- 
lated primarily on the basis of previous research findings and 
hypotheses regarding the environmental correlates of creativity. 
The 165 items were grouped into the following five sections: phys” 
cal characteristics, family history, educational history, leisure-tin! 
activities, and miscellaneous. While most of the questions qp 
with objective facts regarding present or past activities and m 
periences, some called for expressions of preference and other 
concerned anticipated plans and goals. Separate BI scoring keys 
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were developed for each of the four specialty fields, i.e., Boys- 
Artistic, Boys-Scientific, Girls-Art, and Girls-Writing. Further de- 
tails on the content and scoring of the BI are found in Schaefer 
and Anastasi (1968). In addition to the BI, the subjects were 
given the Barron-Welsh Art Seale (BW), the Gough Adjective 
Check List (ACL), and the Franck Drawing Completion Test 
(FDCT). 

Procedure. The four predictive instruments were administered 
during a two hour testing session held outside of school hours. 
The subjects were paid for partieipation in this testing session. 
Identification numbers were employed to provide anonymity, and 
the students were assured of the confidentiality of the data. 

For the BI data analysis, each criterion group was subdivided 
into two groups of 50, used for development of scoring keys and 
cross-validation, respectively. Separate BI scoring keys were pre- 
pared for each of the four fields of achievement. The scoring keys 
ineluded only those items which differentiated between creative 
and control groups at the p < .20 or better significance level. 
In addition to the BI score, each subject received scores for the BW, 
five scales of the FDCT (Abstraction, Asymmetry, Elaboration, 
Flexibility, and Originality), and the 24 currently available ACL 
scales. Hypotheses were advanced for 21 of the 31 variables. More 
specifically, it was hypothesized that the creative subjects would 
score significantly higher than the controls on the BI, BW, the five 
FDCT scales, and the following ACL scales: No. of Unfavorable 
Adjectives, Self-Confidence, Lability, Dominance, Exhibition, Au- 
tonomy, Aggression, and Change. It was further hypothesized that 
the controls would score higher on the ACL scales of Defensive- 
hess, Self-Control, Order, Nurturance, Succorance, and Deference. 

Results. The z ratios between criterion group differences for each 
of the 31 variables are contained in Table 1. Of the 21 scales for 
which directional hypotheses were advanced, 18 showed criterion 
group differences in the expected direction for all comparisons; 
17 of these 18 scales showed at least one criterion group difference 
significant at the .05 level or lower. Inspection of Table 1 also 
indicates that the BI was one of the most effective instruments 
in differentiating creative from control subjects across all four 
fields of achievement. A further comparison of the predictive valid- 
ity of the measures was conducted by means of a multiple re- 


TABLE 1 
z Ratios of Mean Criterion Group Differences 


Test 


Creative Groups Hypothesized 
Higher 


BI 

BW 
FDCT—Abstraction 
FDCT—Asymmetry 
FDCT—Elaboration 
FDCT—Flexibility 
FDCT—Originality 
ACL—Aggression 
ACL—Autonomy 
(X10) ACL—Change 

(Xn) ACL—Dominance 
(Xu) ACL—Exhibition 
(X13) ACL—Lability 
(Xu) fuu Unfavorable 


(Kis) ACL—Self-confidence 


Control Groups Hypothesized 
Higher 


(Xis) ACL—Defensiveness 
(X) ACL—Deference 
(X19) ACL—Nurturance 
(Xi) ACL—Order 

(X20) ACL—Self-control 
(X21) ACL—Succorance 


No Specific Hypotheses 


(X22) ACL—Abasement 

(Xz) ACL—Achievement 

oa ESE ast 
25. a L— Counseling 


(X25) ACL—Endurance 
(Xx) ACL—Heterosexuality 
(X28) ACL—Intraception 
(Xas) ACL—No. Favorable 


Adj. 
(X30) ACL—No. Checked 
(Xa) ACL— Personal 
Adjustment 


Note— 


Criterion Group Comparison* 


(N = 100 in each group) 


CrA-B CrS-B 
vs. vs. 
CoA-B Co8-B 
8.54*** 2.80** 

9.97*** 0.83 
0.26 0.70 
5.50*** 1.75* 
5.31*** 2.51** 
—1.60 1.54 
4.23*** 2.54** 
2.84** 3.14*** 
5.09*** 4.48*** 
5.49*** 2.03* 
0.79 2.09* 
2.92** 3.12*** 
5.24*** 3.29*** 
0.82 1.38 
1.57 "Br f idea 


One-tailed tests of significane were used for all criterion group 


CrA-G 
vs. 
CoA-G 


or 09 -1 
aon 


REEF 


for those ACL scores for which no hypotheses were advanced. N = 100 for eac 
for BI, in which data are reported for cross-validation groups only (N = 50). 


* significant at .05 level. 
** significant at .01 level. 
*** significant at .001 level. 


Creative-Writing-Girls (CrW-G), Control-Writing-Girls (CoW-G). 


CrW-G 
vs. 
CoW-G 
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* Creative-Artistie-Boys (CrA-B), Control-Artistio-Boys (CoA-B), Creative- delen CA. G) 
(Cr8-B), Control Scientife-Boys (Cos by Creative. Art-Girls (CrA-G), Control-Art-Girls 
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gression analysis. The following two rules were applied to the 31 
scales to select the 17 variables for this analysis: (1) criterion 
group differences in the expected direction across all four com- 
parisons, and. (2) at least one of these differences significant at 
the .05 level or better. The Wherry-Doolittle method was used to 
determine the optimal combination of these 17 variables in pre- 
dicting the criterion status of the cross-validation groups. Table 2 
contains both the multiple and partial correlation coefficients of 
the selected variables in each field. For practical purposes, the 
Wherry-Doolittle analyses were terminated after the selection of 
three variables. 

Discussion. The findings of this study highlight the general 
validity of biographical information in predicting creative achieve- 
ment. Of all the predictive variables in the present study, the BI 
proved to be the only instrument to be included in the four opti- 


TABLE 2 
Multiple and Partial Correlation Coefficients for Combined Variables 
Multiple Partial 
Correlations Correlations 
Boys: Artistic Field 
(Art-Writing) 

Variable R a 
BI (Xj) .64 TyxiXxsex, 9.007. 
ACL—Order (Xi) .66 TyxiXiX, — .22 
FDCT—Hlaboration (Xs) .07 TyX,2üX:i9 = 

Boys: Scientific Field | 
(Natural Sciences-Mathematics) 

Variable R w 
ACL—Autonomy (Xs) .86 TyX,XiXi) = 38" 
BI (X) .46 Tyxoxou, = -30 

Girls: ga Field 

Variable 
FDCT-—Elaboration (Xs) .48 Tye X Xr = 35" 
BI (X,) .58 Tyx,Xxx. = -29 

55 Tyxexoa = 22" 


Girls: Writing Field 


Variabl 
BI (X,) kr .55 TYxiXiXe = S 
ACL—Defensiveness (Xis) .61 TyXi XX ve. 
FDCT—Asymmetry (X4) 64 TyxeXO. T 
Y = Creativity criterion. 
*p <05. 


“p «0l. 
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mal test batteries. Thus, the BI was unexcelled in predicting adoles- 
cent ereativity for both sexes and across diverse specialty fields. 
The present results support the conclusion of Taylor (1962) that 
the biographical inventory is the most valid instrument for the 
prediction of creativity against an outside criterion. The results 
also uphold the general observation (Henry, 1966; Freeberg, 1967) 
that the biographical inventory has been found to be the best 
single predictor of complex criteria. The ability of biographical 
information to predict multi-faceted criteria is not surprising in 
view of the report by Taylor and Ellison (1963) that a factor 
analysis of a biographical inventory revealed that it was made 
up of some 20 to 30 relatively independent dimensions. It has 
often been suggested that among the multiple factors assessed by 
biographical information are important motivational traits not 
measured by other instruments, such as drive and perseverance. 

The general validity of biographical data is also related to the 
axiom that the best way to predict future behavior is to study 
past behavior of a related nature. In this regard, biographical 
instruments are typically constructed to measure past experiences 
within specific behavior domains. For example, Brenner (1968) 
found that industrial job absenteeism was significantly related to 
high school absenteeism; Holland and Nichols (1964) report that 
extracurricular achievement in college predicts extracurricular ac- 
tivity in professional life; and Schaefer (1963) noted that study 
habits in high school predicted academic success in college. Simi- 
larly, in the present study, the extent of childhood creative ac- 
tivities was found to be one of the most valid biographical items 
in the prediction of adolescent creative achievement. 

Conclusion. In conclusion, the general validity of the BI in the 
present investigation indicates that creative adolescents have com- 
mon background characteristics, the measurement of which can be 
used effectively in the identification and development of creative 
potential. 

Summary. The validity of a Biographical Inventory in identify- 
ing several groups of creative adolescents was compared with à 
variety of other measures. The results of the multiple regression 
analyses indicated that the Biographical Inventory was the most 
valid instrument for the prediction of adolescent creativity across 
sex and specialty fields. 
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4 NOTE ON CRITERION CONTAMINATION IN THE 
VALIDATION OF BIOGRAPHICAL DATA! 


GARY B. BRUMBACK 
Department of Health, Education, and Welfare 


THE purpose of this brief paper is to suggest that three ex- 
traneous factors may operate to contaminate biographical data 
validity coefficients based on rating criteria: (a) the familiarity 
of raters with biographical characteristics of the ratees, (b) raters’ 
stereotypes about biographical correlates of criterion behavior, and 
(e) the faking of answers to biographical items so that the back- 
grounds of people stereotyped as successful are simulated. 

While it may be quite obvious to the reader that these factors 
could indeed distort validity findings, to the writer’s knowledge 
no public mention has been made of the possibility. In reviewing 
more than 200 published and unpublished articles and documents 
dealing with biographical data, only one study was noted, and it 
dealt primarily with the problem of fakability (Klein and Owens, 
1965). 

The Factor of Familiarity. How the familiarity factor could 
Operate during the validation of a biographical inventory against 
Supervisory ratings of criterion performance can be illustrated 
thusly. Suppose the supervisors know the college backgrounds of 
their subordinates. Further, if these supervisors happen to believe 
that the Eastern colleges yield the most creative graduates, the 
criterion ratings given any subordinates from such colleges may 
be extraneously higher than those given subordinates from less 
valued colleges. Any. scoring key for the biographical inventory 
Vill then surely contain an item on applicants’ alma mater if only 
because the supervisors’ knowledge of their subordinates’ back- 
This article was written by the author in his private capacity. No official 


Support or endorsement by the Department of Health, Education, and Welfare 
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grounds mediated the stereotyped ratings. To be sure, that par- 
ticular item may in addition be intrinsically related to the cri- 
terion. 

In the broader context of criterion contamination, the problem 
of raters being illicitly privy to predictor information has been 
raised before (Bellows, 1941; Brogden and Taylor, 1950). Bellows | 
(1941, p. 501) states that this source of contamination “may be 
defined as any influence of knowledge of predictor information on 
the part of those who determine or contribute to the determina- 
tion of criterion scores of subjects of the investigation upon criterion 
data, and as a result upon indices of validity of predictors.” Bellows 
believes that such contamination can be avoided simply by pre- 
venting even those remotely involved in determining criterion 
scores from access to predictor information. 

Bellows does not mention that criterion raters are likely to be 
familiar with ratees’ biographical characteristics because of the 
visibility of some of the characteristics and because of the social 
psychology of the rater-ratee situation. These two conditions can- 
not be removed as though they were test scores to be cached until 
the validation process is completed. Small group theory and re- 
search suggest that a member’s familiarity with the backgrounds 
of other members of the group is a natural if not necessary phenom- 
enon in interpersonal relationships (e.g, Zaleznik and Moment, 
1964; Ziller and Behringer, 1965). 

The effect of rater familiarity probably cannot be minimized 
simply by excluding questions about conspicuous elements of ratees’ 
backgrounds for two reasons. First, the writer knows of few bio- 
graphical items calling for inconspicuous elements that appear in 
criterion keys. Second, the more obscure the element the easier it is 
to be forgotten or distorted, and the more difficult it is to be 
verified. 

The Factor of Stereotyped Ratings. It is known that there are — 
certain characteristics of the ratee which, even though they may 
be extrinsic to the behavior being rated, do influence ratings of 
that behavior. Two of these characteristics, for example, are age 
(Marrow and French, 1946) and race (De Jung and Kaplan, 
1962). It, does not seem unreasonable to expect that many of the 
background elements solicited in biographical inventories may have 
similar biasing effects upon raters. 
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The circumstances surrounding the validation process would pre- 
sumably modify any contamination from such stereotypes. For 
example, the inflationary effect upon a validity coefficient of stereo- 
typed ratings should be greater if there is greater consensus of 
stereotypes among the raters, and/or if there is a smaller ratio of 
raters to ratees. On the other hand, a validity coefficient may be 
a function of deflating effects as well if there are any stereotypes 
which influence recruiting and hiring practices prior to the valida- 
tion process. As Myers and Evrett (1959) have demonstrated, bio- 
graphical items which significantly differentiate between accepted 
and rejected applicants generally fail to appear on subsequent eri- 
terion keys. Any net inflationary effect should therefore probably 
be greater when there is less preselection involved. 

The Factor of Fakability. Biographical inventories have been 
criticized for their vulnerability to faking by respondents (Bech- 
toldt, 1951). As the findings from studies by Keating, Paterson, 
and Stone (1950) and Mosel and Cozan (1952) suggest, questions 
about biographical characteristics which are verifiable through in- 
dependent records are less subject to distortion. Presumably, only 
cavalier applicants would risk posing as alumni of Eastern colleges 
if the company prized such graduates. Although the prospect of 
employing persons who falsified application information should 
alarm hiring officials, it is interesting to note that even verifiable 
items cannot be protected from the faking that is aided and abetted 
by hiring personnel who covet attractive applicants (Hughes, Dunn, 
and Baxter, 1956). 

Fakability vis a vis validity distortion is much less likely to be 
a potential problem than the two other factors. Obviously the effect 
of stereotyped ratings mediated by the familiarity factor can occur 
regardless of whether faking has occurred. But it is conceivable 
that deceiving applicants will as subsequent employees carry their 
subterfuge into the work setting and eventually into a predictive 
validation situation. Thus, supervisors may become familiar with 
what they believe to be bonafide biographical characteristics, 
and these subsequently influence the criterion ratings. 

Concerned about the fakability of life history forms which in- 
clude transparent items, Klein and Owens (1965) compared life 
history responses scored on a rating criterion key and on a key de- 
rived from what was regarded as an objective measure, number of 
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patent disclosures. Because more respondents under response-faking 
sets could be classified as predicted successes by the rating key 
than by the patent-disclosure key, it was concluded that the sub- 
jective rating criterion was more conducive to faking. It was then 
assumed by’ the investigators that the greater vulnerability of the 
rating key could be attributed to some kind of stereotype operating 
when the key was developed in a previous study (Smith, Albright, 
Glennon, and Owens, 1961). 

Summary. Psychologists and other specialists often turn to bio- 
graphical data in talking about and working on assessment and 
prediction problems. Much of the glamor of biographical data 
techniques stems undoubtedly from validity data reported in the 
literature (e.g, England, 1961; Guion and Gottier, 1965). But 
in those situations where criterion ratings are used, the criterion 
may be contaminated by the raters’ natural familiarity with the 
ratees’ biographical characteristics. The familiarity factor, of course, 
will mediate any effect upon ratings of the raters’ stereotypes about 
biographical characteristics and the ratees’ simulation of character- 
istics which had been faked at the time of application for em- 
ployment. 

Preferably, the developer of a biographical inventory should be 
able to demonstrate that the validity of his instrument is free 
from contamination by isolating and testing the factors discussed 
here in an experimental design or by statistically controlling them. 
Or if one wished to guard against the assumed effects of the first 
two factors, for example, then criterion raters should be selected 
who know relatively little about the ratees’ backgrounds, or who 
hold no stereotypes about the biographical correlates of criterion 
performance. 


REFERENCES 


Bechtoldt, H. P. Selection. In S. S. Stevens (Ed.), Handbook of Ex- 
perimental Psychology. New York: Wiley, 1951, pp. 1237-1266. 

Bellows, R. M. Procedures for Evaluating Vocational Criteria. Jour- 
nal of Applied Psychology, 1941, 25, 499-513. j o 

Brogden, H. E. and Taylor, E. K. The Theory and Classification 0 
Criteria Bias. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
1950, 10, 159-186. í 

De Jung, J. E. and Kaplan, H. Some Differential Effects of Race o 
Rater and Ratee on Early Peer Ratings of Combat Aptitude. 
Journal of Applied Psychology, 1962, 46, 370-374. 


GARY B. BRUMBACK 443 


England, G. W. Development and Use of Weighted Application 
Blanks. Dubuque, Iowa: Brown, 1961. 

Guion, R. M. and Gottier, R. F. Validity of Personality Measures in 
Personnel Selection. Personnel Psychology, 1965, 18, 135—164. 
Hughes, J. F., Dunn, J. F., and Baxter, B. The Validity of Selection 
Instruments under Operating Conditions. Personnel Psychology, 

1956, 9, 321-324. 

Keating, E., Paterson, D. G., and Stone, C. H. Validity of Work 
Histories Obtained by Interview. Journal of Applied Psychol- 
ogy, 1950, 34, 6-11. 

Klein, S. P., & Owens, W. A. Faking of a Scored Life History Blank 
as a Function of Criterion Objectivity. Journal of Applied Psy- 
chology, 1965, 49, 452-454. 

Marrow, A. J. and French, J. R. P., Jr. Changing a Stereotype in 
Industry. Personnel, 1946, 22, 305-308. 

Mosel, J. L. and Cozan, L. W. The Accuracy of Application Blank 
Wha Histories. Journal of Applied Psychology, 1952, 36, 365- 

Myers, J. H. and Evrett, W. The Problem of Preselection in 
Weighted Application Blank Studies. Journal of Applied Psy- 
chology, 1959, 43, 94-95. 

Smith, W. J., Albright, L. E., Glennon, J. R., and Owens, W. A. The 
Prediction of Research Competence and Creativity from Per- 
sonal History. Journal of Applied Psychology, 1961, 45, 59-62. 

Zaleznik, A. and Moment, D. The Dynamics of Interpersonal Be- 

|. havior, New York: Wiley, 1964. 

Ziller, R. C. and Behringer, R. D. Motivational and Perceptual Ef- 
fects in Orientation toward a Newcomer. Journal of Social Psy- 
chology, 1965, 66, 79-90. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1969, 29, 445-451. 


MEASURES OF ACHIEVING TENDENCY: 


ALBERT MEHRABIAN 
University of California, Los Angeles 


MznnABIAN (1968) reported findings relating to scales of achieve- 
ment constructed for males and females. The scales were designed 
to discriminate high achievers (ie., individuals whose motive to 
achieve is stronger than their motive to avoid failure) from low 
achievers (i.e., individuals whose motive to avoid failure is stronger 
than their motive to achieve). The present study provides additional 
validity data relating to revised versions of these scales. In this 
study the revised scales, which consisted of 26 items each, were 
administered to subjects along with measures of affiliation, achieve- 
ment, social desirability, test anxiety, dogmatism, and neuroticism, 
In addition, a series of questions were administered to provide 
information about developmental antecedents of high versus 
low achieving tendencies and socialization patterns. 

The following hypotheses were based on existing findings in the 
area of achievement, such as those reviewed by Byrne (1966). 

It was hypothesized that high achievers exhibit less test anxiety, 
are less dogmatic, less neurotic, and less conforming than low 
achievers. 

Method—Subjects. There were 114 male and 98 female University 
7 rn undergraduates who were paid to participate in the 
study, 

Procedure. The following series of personality scales were ad- 
Ministered to each subject in a group setting: the Mehrabian 
(1968) Achievement Scales with eight items deleted from each 
Such that there were only 26 items in both the male and female 
scales (the latter revised male scale correlated 0.94 with the male 
SC 
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scale reported by Mehrabian, 1968, and the revised female scale 
correlated 0.92 with the original female scale); the Task Orienta- 
tion Scale of Bass (1967) which exhibits properties similar to 
that of an achievement scale; an achievement scale devised by 
Jackson (1967); the Guilford and Zimmerman (1949) Sociability 
Scale, and the Liverant (1958) Social Love and Affection Scale; 
the Mandler and Sarason (1952) Test Anxiety Questionnaire; the 
Eysenck and Eysenck (1963) Neuroticism Scale; Rokeach’s (1960) 
Dogmatism Scale; Cattell and Eber’s (1957) Factor H (a measure 
of venturesomeness in contrast to shyness) ; and finally, the Crowne 
and Marlowe (1960) Social Desirability Scale. In addition, birth 
order information and questions relating to life history and sociali- 
zation patterns were included. The questions were as follows: 


1a. My parents pretty much left me alone and led me to rely 
on my own resources to play, learn, etc. 

lb. My parents strongly rewarded my “good” and punished 
my “bad” behaviors. 

2. I socialize with (a) people who are socially of a lower status 
than myself, (b) people who are socially of the same status 
as myself, (o) people who are socially of a higher status 
than myself. 

3. I tend to conform more to the demands of (a) someone 
who is socially of a lower status than myself, (b) someone 
who is socially of the same status as myself, (c) someone 
who is socially of a much higher status than myself. 

4. Among individuals who are socially my equal, I usually 
associate with those (a) just like myself in attitudes and 
way of life, (b) very slightly different from myself in at- 
titudes and way of life, (c) slightly different from my- 
self ..., (d) moderately different from myself... , (e) 
very different from myself ... , (f) extremely different 
from myself... . 


Each subject received a folder containing all of the above tests, 
aay additional pages on which to record his answers. The tests 
were administered to groups of approximately 50 to 60 subjects, 
who generally took about two hours to complete the tasks. Re- 
sponses to all of the test items, including the above questions, 
were obtained in accordance with the following instructions. 
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The questions which are listed below consist of two or more 

' statements which are designated “a,” “b, “c,” ete. We would 
like you to read all the statements of each question first, and 
then go back and rate each of the statements using the follow- 
ing scale. 


A scale ranging from +4 (very strong agreement) to —4 (very 
Strong disagreement) was inserted here. In other words, for most 
of the tests, instead of obtaining the usual dichotomous decision, 
such as true-false or a preference between two items, each subject 
indicated a score on a nine-point scale for each alternative. This 
helped make the procedure more uniform. 

Results and Discussion. Results obtained with the male scale 
will be considered first. With df = 112, correlation coefficients 
in excess of 0.19 are significant at the .05 level. The findings in- 
dicated that the revised Mehrabian Achievement Scale correlated 
significantly with both of the achievement scales employed in the 
study—namely a 0.20 correlation with the Bass (1967) Task 
Orientation Scale, and a 0.62 correlation with the Jackson (1967) 
Achievement Scale. The stronger relationship exhibited with the 
Jackson scale seems reasonable since that scale was especially de- 
vised to measure achievement, whereas the Bass scale is only 
tangentially a measure of achievement. Since the Mehrabian 
Achievement Scale was designed to be independent of affiliative 
tendencies, it was reassuring to find that it did not consistently 
exhibit inverse correlations with affiliation: a correlation of —0.24 
was obtained with the Liverant (1958) Social Love and Affection 
Scale whereas a correlation of 23 was obtained with the Guilford 
and Zimmerman (1949) Sociability Scale. 

As expected, the male scale correlated —0.26 with test anxiety 
as measured by the Mandler and Sarason (1952) Test Anxiety 
Questionnaire, and also correlated —0.40 with the Eysenck and 
Eysenck (1963) Neuroticism Scale, thus indicating, in accordance 
[m hypothesis, that high tendencies to achieve tend to be 

equently associated with neurotic or anxious traits. 
E male scale also exhibited inverse relationships with dogmatic 
encies, such as a correlation of —0.25 with Rokeach's (1960) 
um Scale, as well as additional evidence for low dogmatic 
encies based on inverse correlations of self-report measures of 
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association with other people who have attitudes which resemble 
too closely the subjects’ own attitudes. Thus, the following correla- 
tions were obtained: —.27 with ^I usually associate with persons 
just like myself in their attitudes and way of life,” —.24 with, 
**., . persons very slightly different from myself,” and —.29 with, 
“persons slightly different from myself.” Also according to the 
Mehrabian Achievement Scale, high achievers prefer to associate 
more with high status others than do low achievers, as exhibited 
by a correlation of 20 with “I socialize with people who are 
socially of a higher status than myself.” However there were no 
significant correlations with willingness to socialize with people of 
the same status or of a lower status. In terms of the findings re- 
lating to conformity, high achievers, in terms of self-report, seem 
to be less willing to conform to the demands of others who are 
socially of an equal or of a lower status as exhibited by a —0.27 
correlation with, “I tend to conform more to the demands of 
someone who is socially of the same status as myself,” and —021 
with “I tend to conform more to the demands of someone who is 
socially of a lower status than myself.” In sum, then, high achievers 
report being more willing to associate with high status targets 
and being less susceptible to influence from same status and low 
status persuaders. 

It is interesting to note that a correlation of .25 was obtained 
between the male achievement scale and the responses to the ques- 
tion, “My Parents pretty much left me alone and led me to rely 
on my own resources . . . ," thus indicating that perhaps indepen- 
dence training is a partial determiner of high achieving tendencies 
in males. Furthermore, none of the three achievement scales em- 
ployed correlated significantly with birth order, a variable which 
has occasionally been found to relate to affiliative tendencies, thus 
indicating that birth order is perhaps not a significant determiner 
of achieving tendencies. 

Finally, the male seale correlated 0.24 with the Crowne and 
Marlowe (1960) Social Desirability Scale. For comparison, the 
corresponding correlation between the Bass Task Orientation Scale 
and social desirability was 0.11, and that between the Jackson 
Achievement Scale and social desirability was 0.44. Incidentally, 
it is interesting to note that the use of 9-point rather than true- 
false versions of the Crowne and Marlowe scale, seems to con- 
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siderably enhance that scale, since the original item analysis of the 
Mehrabian scale employed a true-false version of the Crowne and 
Marlowe scale which was used to eliminate items which correlated 
with the latter. However, the 0.24 correlation even with this more 
sensitive version of the Crowne and Marlowe scale was sufficiently 
low as not to be disconcerting. 

Finally, the male scale correlated 0.56 with the Cattell and Eber 
(1957) Shy-Venturesome Scale, with the high achievers scoring 
on the more venturesome end. A venturesome person is defined as 
being “sociable, bold, ready to try new things, spontaneous, and 
abundant in emotional response. His thick-skinnedness enables him 
to face wear and tear in dealing with people and grueling emo- 
tional situations, without fatigue. However, he can be careless of 
detail, ignore danger signals, and consume much time talking. 
He tends to be pushy and actively interested in the opposite sex” 
(Cattell and Eber, 1957, p. 15). 

In comparison to the male scale, fewer relations were obtained 
with the Mehrabian Female Achievement Scale. With df = 96, 
correlations over .20 are significant at the .05 level. As in the 
case of the male scale, the female scale also correlated with both 
of the other achievement tests employed in the study. Thus, 4 
correlation of .30 was obtained with the Bass Task Orientation 
Scale, and a correlation of .37 was obtained with Jackson's Achieve- 
ment Scale. No significant correlations were obtained with the two 
affiliation measures, namely, the Guilford and Zimmerman Soci- 
ability Scale and the Liverant Social Love and Affection Scale. 
A correlation of 0.34 was obtained with the Cattell and Eber 
Venturesomeness scale. 

In terms of relationships of the female scales to socialization 
Patterns, once again a significant positive correlation of .22 was 
cee with “I socialize with people who are socially of a higher 
Md than myself.” However, no significant correlations were ob- 
aD sg responses to the questions relating to conformity and 
òt ur =- person conformed to, with Rokeach’s Dogmatism Scale, 
Tus a filiation with targets whose attitudes were discrepant from 
E oe to those of the subjects. Finally, relating to the 
im. ^ sian antecedents, once again no significant correlations 
relation P with birth order data. However, a positive Cor- 

23 was obtained with responses to “My parents strongly 
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rewarded my ‘good’ and punished my ‘bad’ behaviors.” The la 
suggests that perhaps high achieving tendencies in females 


Thus, it may be hypothesized that whereas achievement in mal 
tends to be relatively more determined by self-sufficiency and 
dependence from existing norms, achievement in females s 
to be more determined by the acceptance and perpetuation 
existing norms rather than the creation of new ones. 

Finally, the female achievement scale was found to correli 
34 with the Crowne and Marlowe Social Desirability Scale. 
figure compares favorably with the Crowne and Marlowe sea 
correlation of 40 with the Bass Task Orientation Scale, and 
with the Jackson Achievement Scale. 

In sum, both the male and female scales were found to correla! 
significantly with existing achievement and shy-venturesomen - 
scales and to be orthogonal to affiliation scales. Both scales b 
satisfactorily low correlations with social desirability, althot - 
that of the male scale was lower than that of the female sca 
Finally, the male scale also exhibited inverse relationships to te” 
anxiety and neuroticism, and neither of the scales exhibited signifi: — 
cant correlations with birth order. In responses to self-report mea- — 
sures of socialization and conformity, both male and female scales 
were found to correlate positively to a tendency to desire more 
socialization with high status targets, and only the male scale 
reflected a negative correlation with tendency to conform to sam 
or low status others. The male scale also indicated that a hig - 
tendency to achieve tended to be inversely correlated with dogmí- - 
tio traits, But whereas high achieving males were found to È 
relatively less dogmatic than low achieving males, a corr 
difference was not found for females. This, together with 


The present data together with validity data reported by M 
bian suggest that the revised scales may be of some use in 
which attempt to relate achieving tendencies to other 
attributes or social behaviors, and may be of particular 
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in studies which employ Atkinson's (1964) conceptualization of 


achievement. 
Summary. This study provides validational data relating to male 


and female scales of achievement reported by Mehrabian (1968). 
Revised versions of the scales significantly correlated with two 
olher measures of achievement and, as expected, did not con- 
sistently relate to measures of affiliation. The male scale correlated 
inversely with measures of neuroticism, test anxiety, and dogma- 
tism. The study also provides some information about the char- 
Ep differences in the achieving tendencies of males and fe- 
les, 
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Q TECHNIQUE FACTOR ANALYSIS OF THE ROKEACH 
DOGMATISM SCALE 


RUTLEDGE L. JAY 
The University of Arizona 


Roxracm, in studying the structural organization of belief sys- 
tems, distinguishes between the open and the closed mind. Accord- 
ingto a summary of Rokeach's work (Loree, 1965) the person 
With the closed belief-disbelief system would tend to agree with 
certain belief statements. For example: "To compromise with our 
Political opponents is dangerous because it usually leads to the 
betrayal of our own side,” and “There are two kinds of people in 
this world: those who are for the truth and those who are against 
the truth." 

An assumption underlying the dogmatism scale is that the closed 
belief-disbelief system is a unitary characteristic. The method of 
Scoring, in which the assigned scale values are converted to positive 
numbers by adding the constant four to each value and summing 
^ score total, also assumes a unitary basis for the belief-disbelief 
System. However, this assumption may be questioned by means 
E Q technique factor analysis, a useful technique for examining 
empirical data in order to determine whether different groups of 
Individuals differ in their patterns of responses to a set of examples 
of a domain of study. 
an analysis of response patterns to such stimuli as the Rokeach 
E ws it is necessary to apply such a complex method 
ud echnique factor analysis because of the need to see the trees 
eh a forest. Because clusters of individuals are not readily ap- 

heres from the data, a direct interpretation cannot be made. 
the do ore, factor analysis seems necessary to determine whether 

f a scale isin fact a general factor scale. a 

. A dogmatism scale (Rokeach, 1960) was administered 
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to 29 students in an introductory course in educational psychology 
during the fall semester 1967. These students were 20-21 years of 
age, male and female, juniors, and seniors. The dogmatism scale 
consisted of 40 statements concerning which the subjects expressed 
their points of view on a 6-point scale. The socioeconomic level 
of the subjects was determined by the Warner scale of social class. 
The female students tended to be characterized by their own self- 
ratings as upper middle class; the male students tended to be lower 
middle or upper lower class. 

The method of correlation of persons was Pearson product-mo- 
ment correlation. An intercorrelation matrix, as might be ex- 
pected, showed that in general the correlations between persons 
were not very high, the range being from —0.05 to 0.70. A factor 


TABLE 1 
Means and Standard Deviations of Persons 


No. of Persons 29 


No. of Items 40 Person Mean Std. Dev. 
1 0.425 2.159 
2 —0.025 2.247 
3 —0.525 1.811 
4 0.350 2.646 
5 —0.725 1.768 
6 —0.025 1.731 
7 0.300 2.003 
8 0.250 2.362 
9 —0.300 2.388 

10 —0.600 1.780 
11 —0.800 1.897 
12 —0.325 2.129 
13 —0.900 2.228 
14 1.275 2.038 
15 —0.025 2.224 
16 —0.700 2.127 
17 0.550 2.480 
18 —0.075 2.314 
19 —0.850 1.875 
20 —1.325 1.873 
21 —1.050 1.395 
22 —1.000 1.867 
23 —0.975 1.761 
24 0.225 2.304 
25 0.075 2.188 
26 —0.325 1.716 
27 0.125 1.727 
28 —0.825 2.147 

1.803 
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TABLE 2 
Rotated Factor Matriz 
% of Variance 6.703 4.492 2.542 
Commulative % 6.703 11.195 13.737 
I I Il Communality 
1 0.03 0.28 0.65 0.50 
2 0.56 0.18 —0.05 0.35 
3 0.68 0.16 0.16 0.51 
4 0.04 0.67 0.12 0.47 
5 0.51 —0.11 0.34 0.39 
6 0.27 0.18 0.66 0.55 
7 0.61 0.28 0.02 0.45 
8 0.62 0.25 0.12 0.46 
9 0.44 0.38 0.25 0.40 
10 0.02 0.53 0.29 0.37 
1 0.33 0.42 0.29 0.36 
12 0.38 0.38 0.25 0.36 
13 0.12 0.01 0.76 0.59 
14 0.05 0.64 0.07 0.42 
15 0.31 0.48 0.29 0.41 
16 0.72 0.25 0.29 0.67 
17 0.28 0.43 —0.31 0.37 
18 0.35 0.70 —0.01 0.62 
19 0.77 —0.06 0.12 0.61 
20 0.50 0.49 0.06 0.49 
21 0.12 0.55 —0.02 0.32 
22 0.66 —0.04 0.39 0.60 
23 0.58 0.51 —0.22 0.65 
24 0.58 0.44 0.21 0.58 
25 0.54 0.30 0.03 0.38 
26 0.20 0.43 0.20 0.26 
27 0.56 0.18 0.21 0.39 
28 0.64 0.49 0.03 0.65 
29 0.73 0.21 —0.04 0.58 


matrix then was computed to show the factor loadings for the 
29 persons on each of the 16 factors that were extracted. 
po l, Means and Standard Deviations of Persons, displays 
the individual differences in rating behavior of the individuals in 
the sample with respect to the Rokeach dogmatism scale items. 
oF ough the scale is scored as a 7-point scale, the method of 
rating used was to assign values between +3 and —3 to each 
item. No value of zero was permitted. Such a value would cor- 
Tespond to a value of four on a 7-point scale. 
"b de 2, Rotated Factor Matrix, displays the factor loadings for 
ties e-factor solution chosen for interpretation. The communali- 
» Tanging from 0.67 to 0.26, indicate an attempt to interpret 
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a domain in which reliabilities are rather low. This study, of 
course, must be understood to be principally suggestive. 

Factor rotation and factoring. The method of factoring was a 
principal-components analysis which extracts factors such that each 
explains the maximum possible variance. This usually results in a 
large first factor and rapidly declining eigenvalues. Factors were 
extracted until the eigenvalue fell below 0.5 (a program option). 

As many factors should be rotated as can be clearly interpreted 
and which will explain as much of the total variance as possible. 
No factor with an eigenvalue of less than 1.00 was rotated. A 
factor is defined by those persons with the highest loadings. 

Factor interpretation. Sixteen factors were removed from the 
correlation matrix. However, the most meaningfully interpretable 
number of factors appeared to be three. Therefore, a rotation of 
three factors to simple structure was chosen. At this point, prior to 
factor interpretation, it is important to note that in a Q technique 
factor analysis, the factors are persons; in fact, each is a hypotheti- 
cal person defined by the loadings of real persons on the factor. 

For the purpose of arriving at a factor interpretation, the follow- 
ing criteria were established: (1) only those stimulus statements 
were considered in which all three persons with the highest loadings 
on a factor agreed exactly in scale value assigned to a stimulus 
statement; (2) after this, a further logical analysis was made, this 
time with the criterion that at least two scale values must be 
identical and the third not more than one scale position different. 

Items #12, #22, and #34, elicited perfect agreement by all three 
Persons in assigning a highest scale value to each. Items #8, #26, 
#28, #31, and #40 showed perfect agreement among the persons 
in assigning the lowest scale value to each. In fact, these extreme 
scale valued items are the only stimuli for which complete agree- 
ment was achieved. 

Factor I (based on perfect agreement of raters) was interpreted 
as encompassing open-mindedness in a variety of areas, including 
concern for other people and respect for their ideas. Factor 0 
(based on perfect agreement) was interpreted as representing 4 
value complex based on fear of the future and the helplessness of 
man. Factor III (based on perfect agreement) was interpreted 88 
identifying a person who perceives others as needing reform oF 
enlightenment, 


—————— CAE——ER 


RUTLEDGE L. JAY 457 


Factor I (based on two identical scale values and the third no 
more than one scale value different) seems to be an open-minded 
person who accepts a broad range of different points of view and 
is unafraid of life. Further evidence of this interpretation comes 
from the additional fact that the mean total dogmatism score for 
the persons high in this factor is 140.67 points, a range of 25 
points from highest to lowest, and a standard deviation in total 
dogmatism score of 11.56. This contrasts to a grand mean of 146.74, 
for the entire sample and a standard deviation of 24.16. 

Factor II (based on two identical scale values and the third no 
more than one scale value different) seems to describe a fearful 
person with a need for group solidarity. This factor fits well into 
the concept of the believer in single causes and the closed mind. 

Factor III (based on two identical scale values and the third 
not more than one scale value different) seems to describe a 
generally authoritarian person who feels the need to convince others 
and to exercise his need for power. In addition, those who are high 
in this factor have total dogmatism scores that fall between the 
scores of Factor I persons and Factor II persons. 

The three persons characterizing Factor I have a lower mean 
dogmatism score than either of the clusters of persons constituting 
the other two factors. Furthermore, Factor I has less variability 
mn terms of standard deviation and range. 

The three persons with the highest loadings on Factor III have 
à mean dogmatism score between Factor I and Factor II. However, 
the standard deviation and range of scores of persons characterizing 
Factor IIT are identically the same as for Factor II. 

The cluster of persons defining Factor II has a much higher 
mean dogmatism score than either Factor I or Factor III. The 
standard deviation and range of scores characterizing Factor II 
Bes the same as for Factor III. 
ioca I is defined by persons #16, #19, and #29. Factor II 

persons #4, #14, and #18. Factor III involves persons 
#1, #6, and 4.13, 
ME I cluster of persons was unanimous in general dis- 
E With the belief-disbelief statements. Two of the three 
Ws in the Factor II cluster were in general agreement with 
om statements. The persons defining Factor III tended 
T their ratings on the zero point of the scale. 
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In order to finalize the interpretation of the factors, Table 3, 
Classification of Items of Dogmatism According to Measured Char- 
acteristics, was prepared. On the basis of this classification system, 
it was then possible to prepare Table 4, Factors in Terms of 
Measured Characteristics. 

On the basis of Table 4, and particularly in agreement with 
the first criterion for factor interpretation (unanimous agreement 
of all three persons with highest loading on each factor), it is 
concluded that the factors may be interpreted as follows: 


1. Factor I is the pattern of responses to the 40-item Rokeach 
dogmatism scale of open-minded, tolerant, nondogmatic individuals. 

2. Factor II is the pattern of responses of persons who have a 
profound and generalized fear of life. They may be characterized 
as true believers motivated by fear. 

8. Factor IIT is the pattern of responses of persons who believe 
in one cause. They manifest characteristics of coexistence of con- 
tradictions within the belief system, the martyr, intolerance toward 
the disbelievers. They reject aloneness and fear and the avoidance 
of contact with the disbelief systems of others. The term authori- 
tarian personality (Adorno et al., 1950) may be applied to them. 
They are true believers motivated by experiences of extreme re- 
jection or domination in childhood. 


Summary. A Q technique factor analysis was performed of the 
Rokeach dogmatism scale in order to question the assumption of a 


TABLE 3 
Classification of Items of Dogmatism According to Measured Characteristics 
Measured Characteristic Item 
1. Belief in positive and negative authorit; 11, 1, 5, 4, 19 
2. Need for martyrdom vie joies 
3. Belief in the one right cause 21, 3, 6, 8, 9, 29, 35, 39 
40, 10, 16, 2, 12, 22 
4. Intolerance toward the disbeliever 26, 21, 25, 32, 14, 23 
5. Aloneness, isolation, and helplessness of man 
are emphasized 27, 7, 31, 33, 38, 18 
6. Accentuates differences between his own beliefs 
and alternatives belief system 
7. Coexistence of contradictions within the 
belief system 30, 17 
8. Avoiding contact with the belief-disbelief 


systems of others 36, 15, 24, 37 
systems Of one Sicily feiss wir ey 36:35 0 87) en 


| 
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TABLE 4 
Factors in Terms of Measured Characteristics 


Level of Agree- FACTORS 
ment-Disagreement I II III 

+3 or7 Desirable to 5, 3, 7; 3, 3, 3 Belief in 
reserve judg- Belief, fear, one cause 
ment closed mind 

+2or6 3, 5, 4,3 3,55 3, 7, 5, 7 Belief, 
Belief Belief, fear contradiction 

Tloró 1 2201 3, 1,2, 4 
Positive- Belief Belief, + —, 
negative martyr, isolate 
authority disbeliever 

—1lor3 

— 20r2 —3, —1, —4, - —1, -3, —4, 
—3, —3, —8 —5, —6, —3, 
Reject belief —5, —4, —5, 

—8 Reject fear 

—3orl —3, —4, —6, —2, —4, —4, —8, —4 Rejects 
—3, —5, —3 Rejects in- Avoids contact, 
Rejects beliefs tolerance Intolerance 


unitary basis for the belief-disbelief system. The response patterns 
of 29 persons, juniors and seniors enrolled in an introductory 
course in educational psychology at The University of Arizona in 
the fall semester 1967, were correlated, factored by the principal- 
components analysis method, and three factors were rotated to 
Simple structure. The factors were interpreted as being (1) open- 
Minded, tolerant, nondogmatists, (2) persons who have a pro- 
found and generalized fear of life and are characterized as true 
believers, and (3) authoritarian persons who are true believers. 
he factor patterns, rather than a global score signifying an 


En of dogmatism, are the basis for classifying persons into 
ups. 
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THE SCRAMBLED SENTENCE TEST: A GROUP 
MEASURE OF HOSTILITY: 


FRANK COSTIN 
University of Illinois, Urbana 


Tuis paper describes the development of a disguised group test 
of hostility. It was derived from Watson, Pritzker, and Madison's 
(1955) individual test, which was administered by projecting on à 
sereen a set of four words arranged in a scrambled order. The 
subject was asked to assemble a three-word sentence from this 
set, his response being scored as either “hostile” or “neutral.” For 
example, if from the scrambled set “Shoot I’ll you ask,” the sub- 
ject chose “I’ll shoot you,” his response would be “hostile”; if 
he constructed the sentence ^T'Il ask you,” his response would be 
“neutral.” The entire test consisted of 60 sets of scrambled words, 
each set presented one at a time, in rapid succession. All responses 
Were sound recorded, and scored at the conclusion of the test; the 
Subject’s total score was the number of hostile sentences he had 
assembled. Assuming that their test favored the expression of 

“repressed” impulses, Watson and his colleagues hypothesized 
that neurotics would reveal more hostility than “normal” individ- 
uals. The hypothesis was confirmed: the mean hostility score of 
Neurotic patients was significantly higher than that of a control 
Stoup of patients who had not sought treatment. 
ents, this scrambled sentence task was adapted for use 
1958; eral experimental studies of operant conditioning (Anderson, 

Scott, 1958), and also has been employed in a number of 
d Situations during experiments involving antecedents and cor- 
es of hostility (Gillespie, 1961; Sarason, Ganzer, and Granger, 
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1965); apparently, however, no one has previously constructed a 
paper-and-pencil version which can be rapidly administered to 
large groups, and whose reliability and validity have been dem- 
onstrated, although such an instrument could be valuable for a 
wide variety of studies. The Scrambled Sentence Test (SST) de- 
scribed in this report is intended to fill that gap. Although its 
development was based on the responses of college students, its 
format and content seem to be suitable for research with other 
kinds of populations as well. It should also be pointed out that 
unlike the authors of the original scrambled sentence task, the 
present investigator did not assume that the hostility which his 
SST was designed to elicit necessarily represented “repressed” im- 
pulses. 

Preliminary steps in constructing the SST. To establish initial 
guidelines two preliminary forms, A and B, were administered to 
classes in biology, psychology, and history. These classes consisted 
of 150 students from a variety of curricula and from all four 
undergraduate years. Each form contained 30 printed sets of scram- 
bled words roughly parallel in content; all were taken directly from 
the original individual test. 

Students were instructed to read each scrambled set of four 
words and to underline any three words which made a complete 
sentence; they were requested to do this according to their first 
impression, and to work rapidly. After they had completed both 
forms they were asked to write a brief statement concerning what 
they thought the test was measuring. (The printed directions simply 
said that it measured how people perceived word relationships.) 
About one-third of the students were also interviewed briefly to 
obtain additional reactions to the test. An analysis of these data 
led to the following changes: 


(a) Some of the items (i.e., sets of scrambled words) were elim- 
inated or rephrased. 

(b) New items were constructed, with more emphasis on con- 
tent directly relevant to undergraduate life. 

(c) Buffer items (sets of scrambled words containing no hostile 
assemblies) were added to make the purpose of the test less 
obvious, since about 70 per cent of the students had cor- 
rectly estimated its intent. 


FRANK COSTIN | 408 


(d) Items were matched to make the content of the two forms 
closely parallel. 

(e) Items in Form A were arranged in a random order. Items 
in Form B were then arranged in the same order as their 


i 

5 counterparts in Form A. 
(f) The order of the words within each set for Form A was 
randomized. The order of words within each set for Form 
B was then made the same as the corresponding set in Form 
A, 

- (g) Each form now contained 50 items, 30 to be scored for hos- 
tile or non-hostile sentences, and 20 serving as buffer items, 
not to be scored. 


Reliability and validity of Forms A and B. Reflecting the re- 
visions just described, Forms A and B were administered to classes 
in biology, history, literature, physical science, and psychology. 
Again, these classes included students from a variety of curricula 
and from all four undergraduate years. Although the test directions 
emphasized that responses should be made according to “first im- 
pressions” and as rapidly as possible, everyone was permitted to 
finish. Most students were able to complete both forms in not 
More than 12 minutes. 

The upper half of Table 1 shows the mean scores on Forms 
A and B, and the correlation between them when the tests were 
administered in immediate succession to the same sample of stu- 


TABLE 1 


Undergraduate Students’ Mean Scores on Forms A and B of the Scrambled 
Sentence Test and Correlation between Forms 


Order of administration Form A Form B r 
Mean SD Mean SD between forms 


Immediate succession 
Men (N = 103) 1 


82 
Women (N = 118) 1 
th 2 


2 5 
79 


0 1 4.8 

ae. in means i d 
Six Weeks interval 

Men (N = 35) 11 

fomen (N = 58) 8 

iff. in means Sp 

Note—Highest Possible score on each form was 30. 


P « .05 (tow-tai 
»* w-tailed test). 
P < 01 (two-tailed aar 


E 
ger 


.65 


4 
73 


‘9 
.5 


UE. 
Ho 


5.1 
5.2 


* 


464 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


dents; the lower half of the table shows the mean scores of each 
form and their correlation when the two forms were administered 
separately to each sample, six weeks apart. The correlation co- 
efficients seem to be reasonably good evidence of equivalence re- 
liability and stability; one may also infer construct validity from 
the fact that on each form the mean hostility score of men was 
significantly higher than that of women—a finding consistent with 
what one would expect in the expression of hostility in our culture. 

To obtain evidence of concurrent validity, the SST was ad- 
ministered to students seeking help from the Student Counseling 
Service. Eleven counseling psychologists cooperated in gathering 
these data over two consecutive semesters. Each semester, during a 
designated period of three weeks, the counselors requested the first 
ten undergraduates whom they interviewed, and whom they had 
seen at least twice previously, to take both forms of the SST; a 
total of 206 students complied. Without knowing the client’s test 
performance, each counselor also rated his hostile behavior by us- 
ing the following scale: 


Compared with other students at the 
University of Illinois, I would judge this 
client to be (Check one) : 


an extremely hostile person —— 
a moderately hostile person —____ 
a slightly hostile person za Mofa 
a person with practically 
no hostility 


Instructions accompanying the rating forms emphasized that 
while counselors should base their ratings on observations of both 
verbal and nonverbal behavior, they should avoid making “depth” 
inferences which were remote from actually observed behavior. 

For purposes of estimating concurrent validity, judgments in ue 
upper two options of the scale were combined into a "hostile 
category, while those in the lower two options were combined into 
a “nonhostile” category. As Table 2 shows, both men and women 
judged to be “hostile” made significantly higher mean scores D 
each form of the SST than did those judged to be “nonhostile. 
Furthermore, women had significantly lower mean scores than mer; 
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TABLE 2 


Undergraduate Students’ Performance on Forms A and B of the Scrambled 
Sentence Test (SST) According to Counselors’ Ratings of Hostility 


Counselors’ Ratings and SST Scores 
Rated as "hostile" —— Rated as “non-hostile” 


Test and Mean SST SD Mean SST SD Difference 
subjects score score in means 
Form A 
Men 15.4 3.7 9.7 4.6 6:715 
Women 12.1 4.0 7.8 3.9 4.3** 
4 Diff. in means 3.3** 1.9* 
orm B 
Men 13.4 4.3 9.5 4.5 3.9% 
Women 11.1 3.2 6.7 3.5 4.4** 
Diff. in means 2.3** 2.8** 
*: À versus B 
Men 73 78 
Women .70 74 


Note—Highest possible score on each form was 30. 
N's were as follows: “Hostile men = 46; “Nonhostile” men = 77. 
M “Hostile” women = 33; “Non-hostile” women = 50. 
P < .05 (two-tailed test). 
** p < 01 (two-tailed test). 


4 finding which was consistent with that previously found for 
intact classes. These results, then, not only reveal evidence for 
concurrent validity, but also show additional support for construct 
validity. In addition, correlation coefficients reflecting equivalence 
reliability (Table 2) approached the levels previously reported for 
Intact classes, 

Development of Form C. Phi coefficients were computed to de- 
fune which of the hostile sentence assemblies in Forms A and 
8i E most highly related to counselors’ judgments of hostility. 
n 4 WE desired to construct a cross-validated test which 
bs require neither a separate form for each sex or a system 
m erential scoring, only those sets of scrambled sentences were 
as T^ for Form C which had discriminated as well for men 
m Women. Following this procedure, the 30 most discriminating 

Were selected; in addition, 40 buffer items from Forms 4 


al 
E added. These 70 items were then arranged in a random 


2 : 

tree fetsttons wishing to use the SST for research may obtain any of the 

and statin’ together with their complete directions, by writing to the author 
E the particular purpose for which they intend to use the test. 
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Reliability and Validity of Form C. Form C was administered 
to undergraduate classes in biology, history, literature and fine 
arts, physieal science, and psychology. The students in these classes 
were enrolled in a variety of curricula, and represented all four 
undergraduate years. Although, as in the case of Forms A and B, 
the directions of Form C emphasized responding according to one's 
first impression, and working rapidly, all students were permitted 
to finish the test. Practically everyone was able to do so within 
15 minutes. 

Table 3 shows reliability coefficients reflecting internal consis- 
tency (KR 21) and stability of responses. Construct validity may 
also be inferred from the data in the table, since, as in the previous 
forms, men obtained higher mean scores than did women. 

To demonstrate concurrent validity the same procedure was fol- 
lowed as with Forms A and B. Twenty-one counseling psychologists 
participated for two consecutive semesters, giving Form C to 400 
students and making independent judgments of their behavior. As 
Table 4 shows, students whose counselors judged them to be 
“hostile” made significantly higher mean scores on the SST than 
did those whose counselors judged them to be “nonhostile”; the 
correlation (biserial r) between judged hostility and SST scores 
was .65 for men and .66 for women. 


TABLE 3 
Undergraduate Students’ Scores on Form C of the Scrambled Sentence Test 
and Reliability Estimates 
Men Women Diff. in means 
d administration 
140 177 
Men 1.7 9.5 2.2% 
PD 5.2 5.1 
KR-21 75 78 
Repeated adminsistration: 
- weeks interval 
52 75 
Mean: first testing 11.9 9.8 2.1* 
5.0 4.7 
Mean: second testing 11.9 9.0 2.9% 
5.7 5.0 
r: first versus second 67 YU 


Note— Highest. ible 
*p « 05( possil den eoe saa 
**p <01 (hwortailed tae: 


Buses 
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TABLE 4 


Undergraduate Students! Performance on Form C of the Scrambled 
Sentence Test (SST) According to Counselors’ Ratings of Hostility 


Counselors’ ratings and SST scores 
Correlation between SST 


Rated “hostile” Rated “nonhostile” Diff. in scores and ratings 
jects Mean SST SD MeanSST SD means N  biserialr 


n 14.2 3.7 9.3 3.5 4.9** 220 .65 
omen 11.2 3.0 6.9 3.8 4.3** 180 66 


3.0** 


Highest possible SST score was 30. 
N's were as follows: “Hostile” men = 84; “‘nonhostile” men = 136. 


“Hostile” women = 67; "nonhostile” women = 113. 
9^» <.01 (two-tailed test). 


Construct validity may also be inferred from the data in Table 
4 since men had higher mean scores than women. Additional evi- 
dence for estimating construct validity was obtained by correlating 
SST scores with scores on the dominance and conflict avoidance 
scales of the Kuder Preference Record—Personal, Form A, and 
with scores on the verbal parts of the School and College Ability 
Test, Form U. As Table 5 shows, SST scores had no significant 
correlation with scores on the ability test or the dominance scale, 
but did have a significant correlation with scores on the conflict 
avoidance scale. These data are interpreted as evidence favoring 
validity, since it is reasonable to expect that a paper-and-pencil 
test which purports to measure hostility should (a) not reflect 
differences in verbal aptitude, (b) measure something different 


TABLE 5 


Correlations between Undergraduate Students’ Scores on Form C of the Scrambled 
enlence Test (SST) and Their Scores on Tests of Verbal Ability, Dominance, 


and Conflict Avoidance 
N of men = 83 N of women = 96 
Verbal ability Dominance Conflict, avoidance 
SST M W M W M w 
Verbal abilit -07 .03 .15 —.05 nd meh 
Dominance ^ 0  .12 Loc Qs 


Note— y, " 
Te erbal ability was measured with Parts TII of the School and College Ability 
Retera Ui dominance and cont wrvideneo were measured with the Kuder Preference 
$01.05 (two-tailed 1 
“eo NTa tailed =. 
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from mere “dominance,” and (e) be negatively related to the de- 
sire to avoid conflict. 

Students’ understanding of the purpose of the SST. Students who 
took Form C only once (upper half of Table 3) were asked to 
state, immediately after finishing the test, what they thought it was 
measuring. Thirty-four percent of the men, and 32 per cent of the 
women made correct estimates. (It will be recalled that about 
70 per cent of the students recognized the purpose of the test in 
its preliminary forms when buffer items were not used.) In addi- 
tion, students who had taken Form C twice (lower half of Table 
3) were asked, after the second administration, to state what they 
thought the purpose of the SST was. Forty per cent of the men 
and 36 per cent of the women made correct estimates, However, 
correlations between SST scores and estimating the purpose of the 
test (correct versus not correct) were negligible: for students who 
had taken the test only once, the biserial r's were —.09 (men) 
and .02 (women); for those who had taken the test twice, the 
biserial r’s were —.08 (men) and —.04 (women). 

Conclusions. The Scrambled Sentence Test appears to be suf- 
ficiently promising for use in a variety of investigations where à 
disguised paper-and-pencil instrument is desired. Experimental and 
survey-type studies are now being planned in which its predictive 
power can be tested, and compared with that of the more usual self- 
report tests of hostility. 
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ITEM DIFFICULTY SEQUENCING AND RESPONSE 
STYLE: A FOLLOW-UP ANALYSIS 


ALBERT D. SMOUSE ax» DAVID C. MUNZ 
University of Oklahoma 


Srup:s have shown the Achievement Anxiety Test (AAT) to 
predict aptitude and achievement performance (Alpert and Haber, 
1960; Dember, Nairne, and Miller, 1962; Jewell and Carrier, 
1965; Milholland, 1964; Pervin, 1967), and a recent study by 
the writers (Munz and Smouse, 1968) has shown that individuals 
stereotyped on the basis of the AAT perform differently on aca- 
demic achievement tests depending on whether the items are se- 
quenced easy-to-hard (E-H), hard-to-easy (H-E), or at random 
(R). In that study it appeared from the pattern of interac- 
tions that the AAT explained more variance in a distribution 
resulting from achievement test items sequenced R than when se- 
quenced in some other way. One of the implications suggested by 
pe data was that when one is attempting to assess academic 
achievement, H-E sequencing should be used since it seemed to 
Provide least variance attributable to test-taking personality fac- 
tors. Random sequencing, on the other hand, appeared to yield 
relatively more variance which could be attributed to the AAT. 
a Statistical comparison showed this to be true, then it would 
ü “ow that criteria, differently sequenced, are measuring different 
us to say the least. Inasmuch as the above observations were 
Sind on a method whereby the above stereotypes were opera- 
ti ES constructed by selecting extreme scores on the AAT dis- 
of all fee (see the original study for the complete method) , the use 
via he data, including the mid-ranges of the AAT distribution, 
Would Constitute a more conservative test of this notion. Also it 
a EM possible an extension of the initial findings to practi- 
on ae such as the classroom, where decisions are needed 

Ty member of the group. 
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Method. 'The Ss were 181 beginning psychology students at the 
University of Oklahoma on which AAT+ (facilitation scale) and 
AAT- (debilitation scale) scores had been obtained. Each subject 
had been assigned one of three forms of the course final examina- 
tion, the forms differing only in the difficulty sequencing of the 
items (R, E-H, or H-E). Tests of the original hypothesis called 
for an analysis of variance design, but for the re-analysis, the 
two AAT scores were combined by means of multiple correlation 
to predict the final examination score. Thus, the same predictors 
were used to predict three criteria differing only as to item dif- 
ficulty sequence. Analyses of variance of the regression data from 
the respective criterion groups permitted a comparison as to the 
relative amount of variance in each criterion group that could 
be attributed to the AAT (the regression line). These comparisons 
were made by means of Ryan’s “method of adjusted significance 
levels” (1960). In addition, the criterion measures themselves were 
compared for homogeneity of variance. 

Results. Results of the multiple correlations and the analysis of 
variance for the respective group regressions are shown in Table 1. 
Also shown in Table 1 is a comparison of the criterion variances 
using Cochran’s test, which indicated that they were homogeneous 
(C = 858; df = 3, 71; p > 05). Table 2 shows intergroup com- 
parisons, as to variance attributable to the AAT scales, using 
Ryan’s method. 

Discussion and conclusions. It can readily be seen from Tables 
1 and 2 that the AAT explains more of the criterion variance 
when the items are sequenced R than when sequenced E-H. Simi- 
larly, more variance is accounted for by the AAT in the case of 


TABLE 1 
Summary Table for Analysis of Variance for the Regression of AAT Scales 
on Three Forms of the Criterion, Showing Multiple R Coefficients 
and Standard Deviations for Each Criterion Group 


Mean Sq. 
a Mean Sq. For Deviations 

Sequence Criterion Multiple For Regression From Regression F 

Group SD N R (df = 2) (af = N — 3) » 
Random 11.51 55 515" 948.80 101.12 za 
Easy-Hard ^ 10.52 ^ 54 ^ “ggose 427 .56 98.15 116 
Hard-Easy ^ 11.28 aut, ..221 219.84 124.02 2 

èp < 05, 


+p « 01. 
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4 TABLE 2 


Multiple Comparison of Variance Explained (Attributable to Regression) by the 
Combined AAT Scores Under Three Item Difficulty Sequences Using Ryan's Method 


Samples 
Degrees of R E-H H-E 
Freedom 54 53 7l 
Variance 948.799 427.564 219.843 
Summary of Computations 
— Comparison Adjusted 
Groups df Alpha Level F p 
R vs. H-E 54, 71 .0084 4.32 .05 
R vs. E-H 54, 53 .0166 2.22 .05 
" E-H vs. H-E 53, 71 -0166 1.94 .05 


E-H sequencing than with H-E. It is not surprising that there 
significant correlations inasmuch as the AAT was validated 
inst academic criteria. But the interesting part is that the 
ount of explained variance changes systematically across the 
E-H, and H-E forms. The possibility of these results being à 
batistical artifact due to differences in variances of the respective 
fiterion distributions is ruled out by the fact that the criterion 
ariances were homogeneous. 

J From an applied standpoint, the above results have significance 
in that they indicate that the predictive power of the AAT varies 
‘ith the item difficulty sequence of the criterion even when scores 


oretical standpoint, the results offer evidence that item sequenc- 
does affect content validity. Hence, for assessment testing, 
m difficulty sequencing appears to produce the noncontent-de- 
mined variance which Cronbach (1946, 1950) says should be 
minated or controlled. 
hile one may point to the superiority of the R and E-H cri- 
ion sequences when using the AAT as the predictive instrument, 
caution should be taken to avoid the generalization that the 
E sequence is the ideal format for avoiding all test-taking 
bles which might interfere with assessment purposes. For al- 
ugh H-E sequencing minimizes variance attributable to the per- 
ality variables as measured by the AAT, it is possible that such 
Ding introduces other test-taking responses, not measured by 
AAT, which may contaminate content validity to an even 
ub T extent than that indicated by the AAT scales. Thus, whether 
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the H-E sequencing reduces variance attributable to test-taking 
personality factors or simply substitutes one response style for 
another is still very much open to investigation. 

Summary. Following the report of a study which suggested that 
the Achievement Anxiety Test (AAT) explains more achievement 
test variance when the criterion items are arranged randomly 
(R) than when sequenced easy-to-hard (E-H) or hard-to-easy 
(H-E), data already reported on plus additional data, eliminated 
by the original design, were re-analyzed to test the suggested 
hypothesis. Ss were 181 beginning psychology students whose AAT 
scores were used to predict their final examination scores, based 
on an exam sequenced either R, E-H, or H-E. Analyses of variance 
of the regression data from the three criterion groups permitted a 
direct comparison of the amount of variance attributed to the 
respective regression lines. The hypothesis was confirmed and in- 
terpreted in terms of response styles. 
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DIFFERENTIATION OF ACADEMIC INTERESTS 


DEAN L. SHAPPELL 
The Ohio State University 
FRANK C. ARNOLD 
Bowling Green University 
AND WILBUR S. GREGORY 
University of Redlands 


In guidance, the concept of motivation, as based on interest, is 
accepted as an important factor. The work of the counselor and 
the progress of the counselee can more often be facilitated if they 
have at their disposal an appropriate tool for the measurement of 
one of these interrelated areas, namely academic interests. The tool 
under consideration herein is the Gregory Academic Interest In- 
ventory (GAII). It was designed to determine a student’s academic 
interest in various disciplines at the college level. 

Problem. The study reported here was concerned with three 
aspects of academic interests: (1) interest patterns consonant with 
choice of academic major by upperclass students; (2) useable or 
Useful differences in patterns between or among academic majors; 
and (3) useable or useful differences in patterns between sexes. 
Method. The GAII was administered to 722 upperclassmen, 314 
men and 408 women, at a midwestern university. Subjects were 
randomly sampled from individual lists of academic majors rep- 
resenting 16 academic major areas in three undergraduate colleges— 
"Siness, Education, and Liberal Arts. 
of jn wate analyzed for 19 groups. Concern in analysis was that 
licis ermining interest patterns which would characterize a par- 
m "uh group, Individual raw scores for each of the scales were con- 
len is Ercaained stainines and they were grouped ii di 
al E Meh (9-8-7) ; medium (6-5-4) ; and low (3-2 AH 
S between the expected percentages establishe 
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by Gregory and the percentages shown by the students in this 
study. 

Discussion. At least a majority of each group of majors scored at 
or above a stainine score of seven on the GAII describing the same 
major group. In 12 of the 16 groups of majors, the highest stainine 
score obtained on the GAII was most often in the test area labeled 
the same as the major. The data indicate that the general interest 
level of each group was near the expected mean of 5 when all areas 

, Were combined, The mean stainine score for the total group was 4.99. 

Characteristics of the 19 Groups. Table 1 represents differences 
significant at the .01 and .05 level between percentages of a group 
scoring at a high, medium, or low level on a scale and percentages 80 
scoring according to Gregory norms. Differences defined here are 
between observed and expected percentages so organized as to show 
patterns for a designated group on all scales of the GAII. Thus, a 
significantly larger percentage of majors in commercial arts, for in- 
stance, than of Gregory’s norm group scored at 7, 8, or 9 on the 
commercial arts and business administration scales. A significantly 
larger percentage also scored at 1, 2, or 3 on the public service en- 
gineering key. Commercial arts would then be described as a “high 
interest” level for this group and public service engineering as 8 
“low interest” level. The High, Medium, and Low vertical col- 
umns at the extreme right present the number of scales, for each 
group tested, which showed such statistically significant differences. 
Rows at the bottom of Table 1 summarize data by scales rather 
than by major. 

The total group had the same number of high as low interest 
levels, with only one medium level shown. Five groups of majors 
approximated the balance between high and low interests. The 
mathematics majors had the most even pattern, while the business 
administration majors had the only "normal curve" pattern. All 
other groups had an unbalanced ratio of high to low interest levels. 

In developing the interest inventory, Gregory, as well as Strong 
and others, found that dislikes or aversions are significant parts of 
interest patterns. This is demonstrated in the scoring keys in which 
"dislike" responses often have positive weights in each scale. The 
data in Table 1 seem to indicate that low (negative) interests 98 
well as high (positive) interests are significant identifying parts of 
the interest pattern of each group of majors. Certainly considerable 
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TABLE 1 


Significant at the 1% and 5% Levels between Observed and Expected Percentages at High, 
EB pm and Low “Leis on Twenty-Eight Scales of the Gregory Inventory 
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Comparing the academic interest patterns of the male and female ` 
groups in the study to the normative group reveals several areas of 
differentiation. At the high interest level, men more often chose ar- 
chitecture, business administration, history, mathematies, physieal 
education, psychology, secondary education, sociology, and speech. 
Although these interests were not noticeably concentrated in a sig- 
nificant number of the male-dominated groups, they apparently 
were within the general interest realm of the men in the study. Such 
areas as chemistry, civil engineering, English, geology, journalism, | 
languages, physics, and religion occurred at the medium interest - 
level while agriculture, commercial arts, elementary education, home 
economics, mechanical engineering, and military science tended to 
be at the low interest level. 

Such matters as sampling with respect to only three colleges (Bus- - 
iness, Education, and Liberal Arts) may well be involved in the 
general pattern presented. For example, the men in this sample did 
not express a clear technical and science interest. Their interest in 
these areas was, however, more pronounced than that of the women. 
Men’s interest in such areas as speech, English, journalism, and 
fine arts was not so great as the women’s. The men’s dislikes were 
represented by female-dominated areas and academic areas con- 
sidered at the low interest level by both sexes throughout the study. 

Tn the female group, the high interest level consisted of such areas 
as home economics, English, fine arts, languages, speech, sociology, 
psychology, secondary education, elementary education, physical 
education, history, and music. Although history and music were 
prominent interests of the women in this sample, they were not 
significant interests in many of the individual academic groups. At 
the medium interest level, religion, business administration, com- 
mercial arts, history, biological sciences, and architecture appeared 
as significant interests. A low interest level was characteristic of the 
individual group patterns of female-dominated groups. Agriculture, 
all phases of engineering, all phases of science (except biological 
sciences), and military science appeared at a low interest level. 

Women’s patterns of interests seemed perhaps better developed 
than men’s in that the women more clearly indicated patterns 
of likes and dislikes. Likewise, the women’s patterns of interests 
were more uniform than those of the men. 

The pattern for the total group probably reflects the distribution 
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of interests in the sample employed in this study and is undoubtedly 
influenced by characteristics of the institution from which the 


sample was drawn. ^ 
Summary. Major results of this study showed obvious interest 


differentiation among academie groups and between sex groups at a 
midwestern university. 


ib 


Y 


e 
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A very large majority of each group of majors tested had a 
significantly higher percentage of high scores on the scale 
most relevant to that major as well as secondary interests in 
academic areas most related to the major area of concentra- 
tion. 


. Data generally demonstrated obvious interest differentiation 


among academic groups and between sex groups. Differenti- 
ation among the various language arts groups was the least 
pronounced, but with each group the highest percentage of 
responses at the high interest level was either its respective 
area or closely related areas. Twelve of 16 academic groups 
indicated that their respective areas were at the highest in- 
terest level. Sex groups were differentiated by their interest 
patterns, women revealing a tendency toward a sex-related 
pattern, regardless of academic area. 


. High interest scales in the interest patterns of the 16 aca- 


demic groups revealed similarities and differences character- 
istic of related areas of major concentration. 

Low scores, indicative of dislikes or aversions, were found to 
be significant, identifying parts of the pattern which dif- 
ferentiates between the various groups of majors tested. This 
finding justifies the principle that in interpreting interest 
test scores consideration should be given to the total pattern, 
including low (dislike) scores, rather than limited to an 
emphasis on high (positive liking) scores. 


cr urpose of this study was to determine whether patterns of 
demic aa characterize and differentiate among upper-class aca- 

Th ajors and between sexes. ' 
owe RN Academic Interest Inventory (GATI) was admin- 
a Bites E. upperclass students enrolled in 16 academic majors at 
ftom Gc a university. Analysis was concerned with differences 
gory’s norms for majors, men, women, and for total group. 
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Major results indicated clear interest differentiation among aca- 
demic groups and between sex groups. In 12 of the 16 academic 
major groups, the highest interest level was indicated in the major 
area. Sex groups were differentiated by the GAII interest pat- 
terns, with women particularly revealing a tendency toward a sex- 
related pattern regardless of academic area. Low and medium 


scores, in addition to high scores, were found to be useful in iden- 
tifying interest patterns. 
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A STUDY OF THE VALIDITY OF THE SIXTEEN 
PERSONALITY FACTOR QUESTIONNAIRE IN 
PREDICTING HIGH SCHOOL ACADEMIC 
ACHIEVEMENT! 


JERRY B. AYERS, W. L. BASHAW 
University of Georgia 
AND 
JAMES A. WASH 
West Georgia College 


Tux part that personality plays in predicting academic achieve- 
ment is a major concern in the study of adolescents. Overall high 
school achievement and achievement in science and mathematics 
are of particular interest. Cattell and Eber (1964) reported an 
equation for predicting school grades based on certain personality 
factors and Butcher and Gorsuch (IPAT, 1962) reported an equa- 
tion to predict school achievement, based on the High School Per- 
sonality Questionnaire (HSPQ). These equations do not contain 
cognitive variables. 

Crawford and Moyel (1963) studied the incremental validity of 
à personality battery added to cognitive variables. They found 
Significant increments in predicting quantitative thinking and inter- 
Preting literary materials, but not in predicting three other achieve- 
Ment variables. They concluded that even when significant, the ad- 
ditional contributions of the personality variables were negligible. 

The major purpose of the present study was to examine the in- 
cremental validity of the Sixteen Personality Factor Questionnaire 
2 -PF) when added to certain cognitive variables in predicting 

igh school achievement. Another purpose was to construct & set 


ve researchers are grateful for the cooperation and assistance of the prin- 
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of regression equations to predict overall high school grade average 
and science and mathematics grade averages. 

Procedure. 'The subjects for this study were 75 sophomores (37 
males and 38 females) enrolled in a second semester, first, year 
chemistry class at Gainesville High School, Gainesville, Georgia. All 
subjects were enrolled in the 10th grade and had completed three 
semesters each of mathematics and science. 

The California Short-Form Test of Mental Maturity, Level 4, 
(CTMM) and the California Achievement Test Battery, Form W, 
(CAT) had been administered to all subjects as part of the school’s 
regular testing program. Intelligence (IQ), Reading Grade Equiva- 
lent (RGE), Mathematics Grade Equivalent (MGE), Language 
Grade Equivalent (LGE), and High School Grade Average 
(HSGA), Science Grade Average (SGA), and Mathematics Grade 
Average (MGA) were obtained from each subject’s permanent 
record. 

The Sixteen Personality Factor Questionnaire, Form A, 1957 
edition (16-PF) was used to assess personality and was administered 
by the investigators. The battery consists of sixteen subtests as 
described by Cattell and Eber (1964) each measuring a different 
personality trait (Factor). 

Results and discussion. Table 1 presents the means and standard 
deviations of all variables. The mean age of the subjects was ap- 
proximately 16 years, They are above average in intelligence and 
were achieving above grade level in reading, language, and mathe- 
matics. The class personality profile showed all scores in the normal 
Tange except Factor B which reflects their above average intelli- 
gence, 

Intercorrelations of all variables appear in Table 2. Major in- 
terest is in the validities which appear in the last three columns. 
The validities of predicting HSGA were in the range .57 to .67 for 
the cognitive variables. 

The significant personality correlates were Factors B (.38) and 
H (—.29). The significant correlation of Factor B was as expected 
since this is an intelligence measure. Factor H indicated that 
higher grades are earned by students characterized by what Cattell 
and Eber (1964) have called the “well-behaved syndrome.” 

The cognitive variables correlate with SGA in the range .56 to .66. 
The significant personality correlates of science achievement af? 
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TABLE 1 


Summary Statistics of Sex, Age, Cognitive Variables, Personality Factors 
and Grade Point Average (N = 76) 


Variable Mean SD 
Sex 1.5 0.5 
Age (Months) 191.7 ded. 
Cognitive Variables 
IQ 114.3 11.8 
RGE 12.5 1.6 
MGE 12.9 1.6 
LGE 12.8 1.4 
Personality Factors (Raw Scores) 
A 11.4 2.9 
B 7.0 1.9 
0] 13.6 3.7 
E 12.6 3.8 
F 17.6 4.2 
G 1271 3.5 
H 12.6 5.2 
I 10.0 3.1 
L 10.0 2.9 
M 11.9 3.0 
N 10.6 2.8 
o 12.2 3.9 
Qi 9.9 2.6 
Qu 9.3 3.6 
3 10.0 3.1 
Qi 14.5 4.3 
Grade Averages 
H 


SGA 
MGA 


Enuo 
[- e] 


See 
nom 


a 
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> 


pars A (—.24), B (43) and Q, (.25). As expected, Factor B 
Was again significant. Factor A reveals that higher grades in sci- 
nce tend to be earned by students who are detached, critical, like 
to work alone, and tend to avoid compromise. According to Cattell 
and Eber (1964), the lowest scores on this Factor are obtained by 
research physicists. The significant correlation of Qı and science 
Achievement supports the finding with Factor A. Persons scoring 
igh on Q, can be described as critical, analytical, and experimental. 
jos e torrelations with mathematics achievement were generally 
A SN cognitive variables correlated in the range .22 to .29. The 
Melation with MGA was only 29. The highest correlate with 
eae as Factor H (—.32). Factor H shows that the "well-be- 
ed students who are withdrawn and conscientious tend to get 
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TABLE 3 
Comparison of Multiple Correlations 


Stepwise 
Least Squares All Personality Congnitive 
All Variables Variables Variables Variables 
HSGA .819 .818 .522 .731 
BGA «790 .788 547 .724 
MGA .585 .571 .433 .837 


_ better grades in mathematics courses than the more adventurous 


and impulsive students. The other significant correlate was Factor 
C (—.26) which indicates that students getting better mathematics 
grades tend to be emotional, immature, and lacking in frustration 
tolerance. Findley (1968), has pointed out that mathematics is 
attractive to students who do not like to work in areas where there 
are “no right answers.” 

Multiple correlations appear in Table 3. The first column gives 
the results for the least squares criterion using all predictors. All 
other correlations are based on the stepwise regression technique. It 
1s important in the case of MGA to note that personality variables, 
when used alone, resulted in a higher multiple R than the cognitive 
variables when they were used alone. This is consistent with Craw- 
ford and Moyel’s (1963) finding that personality variables aided in 
the prediction of mathematics achievement ; 

F-tests for testing the incremental effect of adding personality to 
the Cognitive variables showed that the added effect of the person- 
ality variables was insignificant for each criterion. Testing the ef- 
fect of adding cognitive variables to the personality variables was 
EE UE at or beyond the .05 level in each case. In the case of the 
Eun of MGA these findings seem inconsistent with the previ- 
oe that the personality variables correlated with MGA 
; Sher than the cognitive variables. This inconsistency arises from 

5 degrees of freedom that are employed in the caleulation of F 
and in the F-test. 

a results of this study both support and fail to support the 
ity of the 16-PF in academic prediction. In the opinion of the 

*A copy of the regression equations for predicting HSGA, SGA, and MGA 


may be obtained from D: in Hi niversity o! 
i all, Ui ity of Geor- 
tia, Athens Georgi x Jerry B. Ayers, 120 Fain 
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investigators the correlation patterns of the 16-PF for predicting 
science and mathematics achievement fairly accurately describe 
the difference between good and bad science and mathematics stu- 
dents. These correlations support the interpretations given by the 
16-PF authors. 

On the other hand there is little evidence that the 16-PF battery 
will improve prediction systems involving cognitive predictors. How- 
ever, with large samples, this conclusion could be contradicted (eg, 
Crawford and Moyel, 1963). Also there is reason to believe that 
some of the students were too young for this version of the 16-PF, 
since it is recommended for subjects 16 years of age and older. 


REFERENCES 


Cattell, R. B. and Eber, H. W. Handbook for the Sixteen Personality 
Factor Questionnaire. Champaign, Ill.: Institute for Personality 
and Ability Testing, 1957, With 1964 Supplementation. 

Crawford, W. R. and Moyel, I. S. Predicting Academic Achievement 
from Intelligence and Personality Data. Florida Journal of Ed- 

, ucational Research, 1963, 5, 19-27. 
Findley, W. G. Personal communication. J uly, 1968. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1969, 29, 485-488. 


THE MAILBAG LITERACY INDEX IN A CLINICAL 
POPULATION: RELATION TO EDUCATION, INCOME, 
OCCUPATION, AND SOCIAL CLASS! 


DONALD K. ROUTH 4x» KATHRYN RETTIG 
University of Iowa 


| 
| Proa (1966) described a procedure for estimating the educational 
level of writers of public mail by rating four characteristics of their 
letters: quality of the paper used, neatness and spacing, grammar 
and word usage, and “graphic maturity.” This Literacy Index for 
the Mailbag was validated on the letters of 133 persons who wrote 
the editor of the Boston Herald with respect to a particular public 
Issue. The Literacy Index was said to be convenient, reliable, and 
capable of reasonably accurate prediction of the writer's educa- 
tional level (grammar school, high school, or college). The present 
study was an attempt to generalize the use of the Literacy Index to 
» different population, the parents of children attending an out- 
patient pediatric clinic, and to measure its relationship to the vari- 
| ables of income, occupation, and social class in addition to the edu- 
cational level of the writers. 
Method. One hundred four letters were taken from the case rec- 
ords of children seen in the Child Development Clinic during the 
Years 1966 and 1967 (52 letters for each year), the procedure being 
to extract the first letter written by a parent or guardian from 
| tach case folder which contained any letters. Each letter was then 
| independently rated by the two investigators. Final ratings were 
oe by resolving differences, if any, by discussion. The educa- 
n n and income levels of the writers of the letters (mean educa- 
| onal level, 12.4 years; standard deviation, 2.21) were obtained 
| ee 


1 5 
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from questionnaires routinely filled out and available in the case 
files. The occupation and social class measures used were based on 
Hollingshead’s (1957) Two Factor Index of Social Position. The 
Hollingshead social class index is merely a weighted sum of occupa- 
tion and education and not an independent measure. Multiple 
regression analysis was carried out by the Doolittle method (Guil- 
ford, 1956) on an IBM 7044 computer. 

Results. Plog (1966) reports inter-rater per cent agreements on 
the four subscales ranging from 85 to 96, with a mean per cent 
agreement of 93. In the present study the corresponding figures 
were somewhat lower, but showed a tendency to improve with ad- 1 
ditional experience and discussion between the raters, From the 
first to the second 52 letters, agreement on quality of paper im- - 
proved from 79 per cent to 92 per cent, agreement on grammar and 
word usage from 42 to 77 per cent, agreement on “graphic maturity" 
from 67 to 79 per cent, while agreement on neatness and spacing 
fell from 60 per cent to 50 per cent. The Pearson product-moment 
correlation coefficients between raters for the present study for 
all 104 letters were .79 for quality of paper, .51 for neatness and 
spacing, .40 for grammar and word usage, .35 for “graphic ma- 
turity,” and .66 for the composite rating using the regression 
weights derived by Plog. It will be recalled that the final ratings 
used for each letter in the present study were based on the combined 
ratings of both investigators. No doubt, these ratings had a somewhat 
higher reliability than those indicated above. 

Intercorrelations among the four sub-scales were all positive and 
significant (p < .05) from .16 to 37. Again, these were somewhat 
lower than the intercorrelations reported by Plog (.35 to .52). 

With respect to prediction of criterion variables, the present 
Study reports the results on only 95 cases, the other nine being 
dropped from the calculations because of missing data of one sort 
or another. In the present sample Plog’s composite score predicted 
educational level Significantly, as indicated by a correlation coef- 
ficient of 50 (p < 01); this correlation is lower than the figure of 
-71 he reports but within limits which might be expected for shrink- 
age on crossvalidation. 

Table 1 shows multiple linear regression weights and correspond- 
ing multiple correlation coefficients for the prediction of the writers 
educational level (in years), the occupational level of the head of 
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TABLE 1 


Multiple Regression Weights and Multiple Correlations 
between Literacy Index Variables and Predicted Criteria 


Multiple Regression Weights 


Quality Neatness & Grammar & Graphic 


uA? 
Spacing Word Usage Maturity 


Weight of Paper 


5.50 .90 -35 .61 T8. ..51 
8.99 —.39 —.16 —.58 —1.15 .51 
8.13 —.98 .35 —.44 —1.10 .54 


99.75 


old? family income, and the social position of the family 
ead, 1957).* It is evident from Table 1 that the predic- 
e writer’s educational level was hardly improved in the 
ample by deriving new, optimal regression weights; that 
i ultiple R of .51 is about the same as the correlation coef- 

of .50 found with regression weights derived on Plog’s sam- 
also seen in Table 1 that the so-called Literacy Index pre- 
pational level, family income, and social class as well as, 
ter than it predicts education; all of the multiple correla- 
ficients approximate .5, which is significantly different from 
05). It should be noted that in the present sample, as is 
finding, education, occupation, and income were interre- 
@ intercorrelations ranging from .33 to .65 (p < 05 in 
on. The present study provides a replication or cross 
n of Plog’s essential results in that a significant relationship 
ind between the rating scales for letters and the educational 
| the correspondents. These essential findings held up de- 
use of a different sample of letter writers. The corre- 
8 of the present sample, for example, were not engaged in 


was coded as follows: 0, executives and major professions; 1, 
er professions; 2, administrative, minor professions; 3, clerical, 
ans; 4, skilled manual; 5, semi-skilled manual; and 6, unskilled. 
income was coded as follows: 0, over $10,000 per year; 1, $8000 . 
12, $7,000-$7,999; 3, $6,000-$6,999; 4, $5,000-$5,999; 5, $3,000-$4,999; 


ollingshead Index of Social Position, the numbers 11-17 corre- 
T, 18-27 to Class II, 28-43 to Class III, 44-60 to Class IV, and 
V, the lowest social class. 
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writing letters “spontaneously” to “a person of importance" (Plop's 
criteria for “public” letter writing) but were most often asking for 
an appointment or trying to get an appointment time changed. 

Secondly, the present results suggest that Plog’s “Literacy Index” 
might be more accurately described as a measure of social position 
than one of educational level alone. 

As to the practicality of the “Literacy Index,” it was indeed 
found to be easy to use. However, the relatively low inter-rater 
agreements obtained even after considerable study and discussion 
of the published rating criteria suggest the need for more explicit 
definitions. Finally, the modest degree of relationship found be- | 
tween the Index and the various criteria predicted might cause 
some to hesitate to use it in a practical situation. 


REFERENCES 


Guilford, J. P. Fundamental Statistics in Psychology and Educa- 
tion. New York: McGraw-Hill, 1956. j 

Hollingshead, A. B. Two Factor Index of Social Position. New Ha- 
ven, Conn.: The Author, 1957. 8 

Plog, S. C. A Literacy Index for the Mailbag. Journal of Applied 
Psychology, 1966, 50, 86-91. 


b EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1969, 29, 489—493. 


COMPARATIVE VALIDITIES OF THE STRONG 
VOCATIONAL INTEREST BLANK ACADEMIC 
ACHIEVEMENT SCALE AND THE COLLEGE STUDENT 
QUESTIONNAIRE MOTIVATION FOR GRADES SCALE 


CARL A. LINDSAY an» RICHARD ALTHOUSE 
The Pennsylvania State University 


In the context of improving academic prediction there has been 
a recent interest in the predictive validity of two scales of academic 
Motivation: the Strong Vocational Interest Blank (SVIB) Academic 
Achievement Scale (AACH) (Campbell, 1966; Campbell and Jo- 
hansson, 1966), and the College Student Questionnaire (CSQ-1) 
Motivation for Grades (MG) Scale (Myers, 1965; Peterson, 1965; 
Furst, 1966). Both scales stem from different approaches toward 
- measurement of academic motivation, but each being strongly 
empirically oriented is moderately correlated with achievement. 

No studies to date have examined the comparative validities of 
these two scales. Therefore, the present study was undertaken to 
(a) compare the simple and incremental validities for college 
freshmen, and (b) form some tentative hypotheses about the dif- 
ferential validities of the two academic motivation scales. 

A discussion of the content, rationale, and statistical properties of 
po scales of interest may be found in Lindsay and Althouse 

Method-Subjects. The SVIB was administered to all incoming 
Pennsylvania State University freshmen during the summer of 1966 
: Part of a pre-registration testing and counseling program. Dur- 
Ng the 1966 Fall Term Orientation Program at the University Park 
aa the CSQ-1 was administered to a random sample of 299 
ge 89 female freshmen. Subjects (Ss) for the present study 
both th e sample of 388 University Park freshmen who had taken 

e SVIB and the CSQ-1. 
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Description of variables. AACH and MG scores were obtained 
from the SVIB and CSQ-1, respectively. Two indices of academic | 
aptitude, Scholastic Aptitude Test (SAT) Verbal and Math scores - 
were obtained from the Ss' admissions records. A measure of past 
performance, high school average (HSA), which is the student's 
mean achievement during the last three years in high school, was | 
also obtained. The criterion variable was the cumulative grade | 
point average (CGPA) at the end of the freshman year. Both HSA _ 
and CGPA are on a four-point scale where A — 4.00, B — 300, l 
eto. 

Analyses. Two separate correlational analyses were run for both - 
male and females. To examine the simple validities of the MG and 
AACH scales, descriptive statistics and zero order correlations were 
developed using MG and AACH, a common set of predictors (SAT 
Verbal, SAT Math, and HSA), and CGPA. To provide a basis for 
comparing the incremental validities of MG and AACH, a three 
variable multiple R (SAT Verbal, SAT Math, HSA) for predicting 
freshmen CGPA was developed. The multiple R’s with (a) MG 
and (b) AACH added to the three variable R’s were then com- 
pared to the original R’s. 

Results. The results of the simple validity analysis are presented 
in Table 1. The intercorrelation between the MG and AACH scales 
is .16 for both sexes suggesting that each scale measures essentially 
different motivational factors. The simple validity coefficients of the 
MG scale (r = 27 for males; .29 for females) are higher than 
those of the AACH scale (r = .10 for males; .25 for females) with 
CGPA. In fact, the correlation of AACH with CGPA is not sig- 


TABLE 1 y 
Means, Standard Deviations and Intercorrelations of Predictor and Criterion TN 
Males Females 
N = 226 N =88 
Variable Ri «8D £X. 8D 1.:222:9/ 1 RO 
% 
1. SAT Verbal 536.41 78.40 570.40 72.16 n 15 —16 
2. BAT Math 602.19 75.92 574.42 64.92 37 14) olf z0 
3. HSA 3.08 49 3.85 46 15 17 2 Y 
4. MG 24.28 4.08 — 26.76 4.73 — —-14 12 40 e 
5. AACH 45.55 12.03 44.67 9.30 23 26 31 1 


10 
6. GPA 2.52 | 155 2.86 48 19 19 38 2 


onil 
Note— Correlations for males are below diagonal for; females above. Decimal points have bee? 
For males, r.os = .138, r.o = .180. For females, r.os = .210, r.o = .283. 
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TABLE 2 
Comparison of Incremental Validities of the Mt G and AACH Scales 


Males (N = 288) Females (N = 89) 


fables* Multiple R Increment F Variables Multiple Increment 


.4132 = = 12,3 .5177 = 
234 .4480 .0357  10.9* 123,4 .5504 .0327 
35 A172 .0040 0.6 1,2,3,5 -5289 .0012 


pon 1—SAT Verbal, 2—SAT Math, 3—HSA, 4—MG, 5—AACH. 
«05. 
p< ol. 


nificantly greater than zero (.05 level) for males. The male validity 
coefficients for the AACH scale are considerably lower than that 
| coefficient reported by Campbell (1966), who found a zero-order 
correlation of .36 between AACH with CGPA in a cross-validation 
| sample of 250 college males. 

The correlations between MG and SAT scores do not differ sig- 
nificantly from zero for either males or females, but, three of the 
four correlations of the AACH scale with SAT scores are signifi- 
tantly greater than zero (.05 level) for both males and females. 
This finding suggests a higher order relationship of AACH items 
than of MG items with measures of aptitude. Both scales are also 
moderately correlated with past performance (HSA). 

Shown in Table 2 are the incremental validity data. Not unex- 
Pectedly, females (R = .52) are more predictable than males (R = 
. 41) when the common set of predictors (HSA, SAT Math, and 

Verbal Scores) are combined to develop a three variable multiple 
(OR with CGPA as the criterion. From an incremental validity point 
of view, the MG scale is superior to the AACH. Adding the MG 
Seale to the predictors produced a slight increment for both males 
p? and females (035) whereas the addition of the AACH scale 
E. e predictors is negligible, (males = .004; females = 011). 

is lack of AACH incremental validity agrees with Campbell's 
(1966) study. 

È oe The present study supports Campbell (1966) in that 
a Ver variance in achievement is related to AACH scores is al- 
equally well measured by aptitude and past performance. 
proe can be said for the MG scores although no published 
qu are available for comparison. The MG scale made a better 
g in the comparisons made than did the AACH scale, with 


nw | 
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slightly higher validity coefficients and multiple R inore 
However, correlations of .27 (males) and .29 (females) and. 
tiple R increment of .04 are too low to be of much practical u 
A limitation of the findings of the present study relates 
characteristics of the Ss employed. Comparatively speaking, : 
are rather select (mean SAT total score is around 1150, and 
over 3.00) and homogeneous. Y 
The GPA standard deviation of Campbell’s (1965) Ss wasa 
-85 whereas in the present study it was about .50. A case co 
made to the effect that Ss used in this study are not represent 
of college freshmen in general and hence a fair comparison of 
validities of the two scales was not made. Further research is in- 
dicated using a more heterogeneous group of students. 
Subject to the limitations just raised, the following conclusions are 
offered. ul 
1. From a practical, e.g., admissions or counseling, point of view, 
both the MG and AACH scales appear to be of limited utility in 
predicting first year college achievement. Neither the simple nor | 
incremental yalidities of the scales is high enough to warrant their | 
use in applied settings. l 
2. In agreement with Furst (1966) it appears that the MG scale. 
is a somewhat better predictor of achievement than is the AACH 
scale because MG items are a special instance of a more general 
principle, i.e., better prediction results when the elements in the | 
predictor represent as directly as possible the critical elements in 
the criterion. In this regard, the present study suggests that first 
year achievement in college is more directly related to past indices | 
of grade-getting behavior than to interest patterns associated with 
good scholarship. This conclusion is not intended to detract from 
Campbell’s (1965) finding that there is a substantial relationship 
between scores on the AACH scale and eventual educational level: ` 


REFERENCES i 
Campbell, D. P. Manual for the Strong Vocational Interest Blank 
for Men and Women. California: Stanford University Press 


Campbell, D. P. The Results of Counseling: Twenty-five Years bar 
ter, Philadelphia: Saunders, 1965. olas- 

Campbell, D. P. and Johansson, C. B. Academic Interests, Scholas- | 
tic Achievement, and Eventual Occupations. Journal of Cound 
seling Psychology, 1966, 13, 416-424. 


LINDSAY AND ALTHOUSE 493 


Furst, E. J. Validity of Some Objective Scales of Motivation for 
Predicting Academic Achievement. EDUCATIONAL AND PSYCHO- 
LOGICAL MEASUREMENT, 1966, 26, 927-933. 

Lindsay, C. A. and Althouse, R. Comparative Validities of the 
Strong Vocational Interest Blank Academic Achievement Scale 
and the College Student Questionnaire Motivation for Grades 
Scale. Student Affairs Research Report 68-2. The Pennsylvania 
State University, University Park, Pa., June, 1968, (mimeo). 

Myers, A. E. Risk Taking and Academic Success and Their Relation 
to an Objective Measure of Achievement Motivation. EDUCA- 
TIONAL AND PSYCHOLOGICAL MEASUREMENT, 1965, 25, 355-363. 

Peterson, R. E. Technical Manual—College Student Questionnaire. 
Princeton, N. J.: Educational Testing Service, 1965. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT " 
1969, 29, 495-502. 


THE VALIDITY OF THE HOWARD MAZE TEST AS A 
MEASURE OF STIMULUS-SEEKING IN PRESCHOOL 
CHILDREN 


KAREN VROEGH 4x» MILLICENT HANDRICH! 
Institute for Juvenile Research 


Howarp (1961) developed the Maze Test to measure the per- 
sonality variable of stimulus-seeking as discussed by White (1959) 
and Fiske and Maddi (1961). A paper and pencil maze is presented 
several times to an S to complete by drawing a line on any of a 
number of clear paths from the start to the goal. There are no 
blind alleys and all paths are equally “correct.” The S has the 
option to vary his routes through the maze on each trial or to re- 
Peat his initial route totally or in part. The amount of change in 
Paths to the goal across several presentations of the maze indi- 
tates the amount of stimulus-seeking behavior. Thus, the presenta- 
tion of several mazes constitutes a repetitive task that must be 
structured by S according to his customary way of seeking inter- 
‘ction in his environment. It is assumed that his response in the test 
situation mirrors his response in other situations. 
Evidence of the validity of this measure of stimulus-seeking be- 
E in adults has been reported. Domino (1965) found a cor- 
* jon of 42 (p < .01) between a rough index of college Ss’ actual 
imulus-seeking behavior (number of activities attended on cam- 
ve bs performance on the Maze Test. Howard (1961), Howard 
hn ^c ii (1965b), and Sidle, Acker, and McReynolds (1963) 
d that psychiatrie patients did not show as much stimulus- 
bal $d behavior on the Maze Test as general medical patients, col- 
"dents, and employment counselors. 


1 
Dirt, Authors wish to acknowledge the contributions of Hynda Gamze, 
Tinis, to as North Shore Congregation Israel Nursery School, Glencoe, 
e initiation and continuation of this study. 
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Purpose. The present study was designed to obtain evidence of the 
validity of a downward extension of the Howard Maze Test for 
measuring stimulus-seeking in preschool children. Two types of va- 
lidity information were sought, one construct and the other, con- 


current. To evaluate the construct validity of thé Maze Test, the - 


stimulus-seeking performance of children from two environmental 
backgrounds, one advantaged and the other disadvantaged, was 
compared. It was predicted that the advantaged children would 


show significantly more stimulus-seeking behavior on the Maze | 


Test. 


Piaget has said that the more new things a child sees and hears, 


the more he is interested in seeing and hearing. Middle- and upper- 
class children are less likely to be deprived of situations that pro- 
mote stimulus-seeking behavior than are lower-class children. Cur- 
iosity behavior is encouraged. Lower-class family life, on the other 


hand, is generally more authoritarian in nature. Parents in this - 


environment often inhibit a child's movements rather than en- 
courage them. The child is deprived of a variety of stimuli and 
learns to expect and be satisfied with the routine and the monot- 
onous, 

To assess the concurrent validity of the Maze Test, teachers 
ratings of children’s stimulus-seeking behavior were correlated with 
maze scores. Positive correlations were expected. 

Subjects. Thirty-one disadvantaged Negro children (16 boys; 15 
girls) from a housing project preschool and 42 advantaged white 
children (19 boys; 23 girls) from two private preschools took part 
in the study. All of the children were between the ages of four years 
and five years, two months. 

Procedure. Each child was individually administered first the 
Howard Maze Test box maze adapted for children (Fig. 1) 88 ® 
measure of stimulus-seeking (Howard, 1961) and then the Good- 
enough-Harris Draw-A-Man (D-A-M) test as a measure of intel- 
lectual ability. The D-A-M was administered because there w33 
every reason to believe that there would be differences in measured 
intelligence of the advantaged and disadvantaged children, and be- 
cause evidence of the relationship between stimulus-seeking as meas- 
ured by the Maze Test and intelligence is not entirely consistent. 

Four copies of the maze were presented to each child with these 
initial instructions: 


Figure 1. An adaptation of the Howard Maze Test box maze. 
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“You are going to play a game. The bunny here (point) is 
hungry and wants to get to these carrots. I want you to draw a 
line to show how the bunny gets to the carrots. Stay on the paths, 
Do not cross any lines. Now put your pencil here and draw a line 
from the bunny to the carrots, taking any path you want." 


For subsequent, maze presentations, the child was urged to take any 
path he wanted. The mazes were scored according to the conven- 
tions presented by Howard and Diesenhaus (19582). 


The eight stimulus-seeking items presented in Table 1 were used 


to obtain ratings of the advantaged children's stimulus-seeking be- 
havior in the classroom. Two teachers did the ratings for each child. 


Results. The differences between the intellectual ability scores of 


the advantaged and disadvantaged boys and girls shown in Table 2 
were analyzed by a two-way analysis of variance. It was found that 
the advantaged children had higher D-A-M scores than the dis- 
advantaged children (F = 6.02, df = 1,69, p < .05) and that 
girls had higher D-A-M scores than boys (F = 8.58, df = 1,69, 
p < .05). There was no significant interaction. 


In order to determine the relationship of D-A-M scores and 


maze scores, product-moment correlations were computed for the 
advantaged and disadvantaged children separately. The two vari- 
ables were found to be independent for both groups, advantaged 
(r = .05) and disadvantaged (r = .18). 


TABLE 1 
Items Used to Obtain Teacher-Ratings of Stimulus-Secking! 


Items 


* This child has no ideas about what to do with himself (herself) until the 


makes a suggestion. 


- Although this child is capable of using play materials in ways that most 


ldren do, he (she) completes the most ordinary of tasks in original ways. 

This child eagerly explores his (her) environment, seeking situations 
things that are new and different to him (her). aay: 
This child likes routine. He (She) approaches an activity in a patterned n 
This child maintains a high level of activity, even though there is no! 
Specific task for him (her) to do. à 
This child avoids unfamiliar activities and objects. R 5 
When a task is presented to this child, he (she) completes it in the easiest 
possible way. No more than what is asked is put into the task. did 

child plays with a variety of activities during free play rather 
one or two activities, but does not flit from activity to activity. 


| 
| 
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TABLE 2 


Summary of Significances of Differences between Scores of Intellectual Ability 
and Stimulus-Seeking of Advantaged and Disadvantaged Boys and Girls 


Advantaged Disadvantaged Total 
Variable Boys Girls Boys Girls Adv. Disadv. 
Number of Ss 19 23 16 15 42 31 


Intellectual Ability: 
standard D-A-M scores 80.5 90.2* 72.7 82.1* 85.8* 77.2 
Btimulus-Seeking: 
Maze Test scores 4.84 3.33 2.93 1.21 4.01** 2.10 


* Significantly larger with p « .05. 
** Significantly larger with p < .01. 


Construct Validity. Since intellectual ability was not found to be 
related to stimulus-seeking performance, the significance of the dif- 
Terence between the maze scores (See Table 2) of the two groups 
was determined by a ¢ test. As hypothesized, advantaged children 
had higher stimulus-secking scores than disadvantaged children (t 
= 3.08, df = 71, p < .01). 

Concurrent Validity. Before testing for the concurrent validity of 
the Maze Test, the reliabilities of and the interrelationships among 
the eight stimulus-secking items and the sum of the items were 
determined, The ratings of the advantaged children by two teachers 
in each school were correlated. In School 1, all of the items in- 
dividually and the sum of the items were reliable with p < .01. The 
item coefficients ranged between .88 and .96 and the sum coefficient 
Was .98. For School 2, the reliability coefficients for the sum of the 
items and for all items individually except 4, 5, and 8, were sig- 
nificant with p < .05. The significant item coefficients in School 
2 ranged between .44 and .73 and the sum coefficient was 61. 

To determine the interrelationship of the eight items and the sum 
of the items, the means of two teachers’ ratings were intercorrelated 
for each school, From the results in Table 3, it can be seen that for 
both schools, all of the items except Item 4 were significantly cor- 
Telated with the sum of the items (p < .05). For both schools, Item 
$ im particular, was found to relate to the sum stimulus-seeking 
rating, For School 1, correlations among all the items except those 
involving Item 4 were significant. (p < .05). For School 2, only 
‘ome of the items were found to be interrelated. 

Judged to be reliable measures of stimulus-seeking behavior, to be 
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TABLE 3 


Intercorrelations of Ratings on Eight Items of Stimulus-Seeking Behavior, 
the Sum of the Eight Items (Total), and Howard Maze Test Scores 
in Two Schools of Advantaged Children 


Item 


Item 2 74* 01 .68* .52* 73* 7* — .83* 3 
48* —.07 .34  .25* 74* 21 66° 2 

Item 3 14 .63* .80* — .80* 74*  .9i* 10 
04 .85* .87*  .47* 42*  .99* M 

Item 4 24 .18 00 —.06  .25 .08 
1 .0 —.20 —.28 .04 —.35 

Item 5 61%. 60%. .68* .80f .00 
74* 23 .85 84* nu 

Reems 6 SUA 05 85 —.02 
37 .85 82* 15 

Item 7 .65*  .84* 27 
40  .65* .80 

Item 8 .81* .06 
52* .58* 

Total .18 
Ei 
trom sch "um correlations are based on the ratings from School 1 and the lower correlation’ 
P «.05. 


related measures, and to have at least face validity, the mean rat- 
ings of the individual items and the sum of the items were corre- 
lated with the maze scores by school. It can be seen in Table 3 that 
generally none of the item ratings nor the sum ratings were signif- 
icantly related to maze scores in either school. 

Discussion. In this study, evidence for the construet validity of 
the Maze Test was presented. The data collected supported the 
hypothesis that advantaged children engage in more stimulus-seck- 
ing on the Maze Test than do disadvantaged children. However, ® 
test of the concurrent validity of the Maze Test was not significant. 
Maze scores did not correlate with teacher ratings of children 
stimulus-seeking behavior in the classroom. 

It was suggested to the investigators that maze scores might only 
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provide rank order data. To determine whether an analysis of the 
data appropriate for rank order data would provide a significant 
relationship between maze scores and behavioral ratings of stimulus- 
seeking, the data were grouped into high, medium, and low maze 
scores and into high and low stimulus-seeking ratings. Analysis of 
these scores by X? still indicated no relationship between the two 
measures of stimulus-seeking. 

Evidence presented for the validity and reliability of the items 
used for rating stimulus-seeking, would tend to indicate that the 
items provided an adequate criterion for assessing the validity of 
the Howard Maze Test. Item 3 (See Table 1) had the strongest 
face validity, correlated highest with the sum of the eight items, 
and had one of the highest reliability coefficients; yet ratings on 
this item did not correlate with maze scores. The fact that no re- 
lationship, as opposed to a negative relationship, was found be- 
tween maze scores and teacher ratings of stimulus-seeking, coupled 
with other evidence of the validity of the Howard Maze Test, sug- 
gests that further work should continue in an attempt to establish 
the concurrent validity of the test. What specific simulus-seeking 
behavior the Maze Test is measuring should be established. 

A great deal of research today concerns the evaluation of the 
experiences given to young disadvantaged children. Suitable meas- 
ures for evaluating the effects of such experiences, including ex- 
Periences encouraging stimulus-seeking behavior, are few, especially 
When it is desirable to compare disadvantaged children’s perform- 
ance with that of advantaged children. There is almost always an 
intellectual difference between two such groups, yeb performance 
on many tests are dependent upon at least average intellectual 
ability. Apparently performance on the Howard Maze Test is in- 
Ee of intellectual ability. Thus, it is a good candidate for 

aluating the effects of experiences designed to foster stimulus- 
seeking. There is a need, however, to determine the nature of the 
everyday behavior being measured. 
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IE CRITERION-RELATED VALIDITY OF ENGLISH 

IGUAGE SCREENING INSTRUMENTS FOR FOREIGN 

"STUDENTS ENTERING THE UNIVERSITY OF 
SOUTHERN CALIFORNIA 


JACK D. BURKE 
University of Houston 


B. MICHAEL, ROBERT B. KAPLAN, ann ROBERT A. JONES 
University of Southern California 


ta total group of 178 foreign students enrolled at the Uni- 
Í Southern California (USC), as well as for four specific 
groups of varying levels of evaluated proficiency in English 
ls, this correlational investigation was undertaken to determine 
ion-related validity of a battery of six achievement and 
xaminations that were administered at the start of each 
ter over a three-year period to screen foreign students for 
h competency. Representing in part a replication of two previ- 
dies by Jones and Michael (1961) and Jones, Kaplan, and | 
(1964), the three major purposes of this investigation were: 
"to ascertain with respect to two criterion variables—(a) grade 
| average (GPA) and (b) academic standing on & 4-point 
(continous satisfactory achievement, probationary status with 
clearance, probationary status never altered, and dis- 
|) —the predictive validity of each of the six cognitive screen- 
is requiring skills in receptive and expressive language func- 
0 order coefficients) and of a composite score obtained 
six predictors (multiple correlations), (2) to report 
lations among all pairings of predictor and criterion vari- 
and (3) to obtain the validities of the same predictor vari- 
And of their composites with three additional evaluative 
rs employed in the English Communication Program for 
udents (ECPFS)—(a) final grade in ECPFS at the end 
ter’s work based on the composite judgment of as many 
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as eight instructors, (b) the Cooperative School and College Abil- 
ity Test (SCAT) Verbal seore, and (c) the SCAT Quantitative 
Score. 

Method. The test and criterion variables are cited in Table 1. In- 
formation concerning the three USC devised measures—English 
Theme, Speech Interview, and the Larry Ward English Examina- 
tion for Foreign Students—may be found in the two articles previ- 
ously cited. 

The subjects in the total sample were predominantly single, 
self-supporting males in their mid-twenties and native speakers 
of non-Indo European languages. However, 6.2 per cent of the 
students were in the 18-19 age category, and 22.4 per cent were 
thirty years of age or more. Of the 178 students, 39 were females. 
Nearly 60 per cent were graduate students majoring in the social 
sciences or engineering. From this total sample, four subgroups 
were formed; (1) 85 students enrolled in selected courses in a pro- 
gram in English, (2) 46 in a full program of English Communica- 
tion courses, (3) 29 judged to be sufficiently fluent to enroll in a full 
academic program of regular coursework, and (4) 16 who obtained 
waivers from their respective academic departments allowing them 
to be excused from taking ECPFS courses. Two students were ex- 
cluded from membership in any one of these four subgroups, as they 
represented unique cases. 

Intercorrelations among all predictor and criterion variables, co- 
efficients of multiple correlation between a composite of the six 
predictor variables and each of the criterion measures, and beta 
weights were determined from an IBM computer program at the 
USC Graduate School of Business Administration for the total 
sample as well as for each of the four subgroups. 

Findings. The statistical findings for the total sample are re- 
ported in Table 1. In view of the absence of any statistically sig- 
nificant coefficients of multiple correlation between the composite of 
six predictor variables in any one of the four subgroups or between 
the composite of the six predictors and ECPFS final grade in the 
two appropriate subgroups, it was decided not to report the cor- 
relational data for any of the subsamples. 

Conclusions. The principal conclusions, which were in general 
agreement with those found by Jones, Kaplan, and Michael (1964), 
were as follows: (1) GPA could be predicted more accurately than 
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could the measure of academic standing; (2) the Speech Interview, 
the Larry Ward English Examination for Foreign Students, and 
the California Reading Test were the three most valid scales for 
the prediction of GPA among the six cognitive tests employed; (3) 
the intercorrelations among the predictor variables were moderate 
to high and thus suggested the presence of a substantial degree of 
overlap among them; (4) the interrelationships among the criterion 
variables were relatively low and considerably less than those among 
the predictors; (5) there was little relationship between GPA and 
any one of the three ECPFS departmental criteria (variables 9, 10, 
and 11); (6) the relationship between final grades after one semester 
of work in ECPFS (composite judgment of instructors) and each 
of the predictors and a weighted combination of these predictors 
was substantial; and (7) a composite of the predictor variables, as 
expected, showed a higher degree of validity with each of the cri- 
terion measures than did a single predictor, although with the ex- 
ception of the ECPFS criterion variable for the total sample the 
relationships were low. 

Recommendations. It was recommended that despite limitations 
posed by the restriction in range of talent and the unreliability of 
GPA as a criterion measure (1) the California Reading Test, 
Speech Interview, and the Larry Ward English Examination for 
Foreign Students be retained as indicators of English proficiency 
of foreign students entering USC, and (2) additional experimen- 
tation with other scales be continued to improve the predictive ef- 
fectiveness of screening instruments for the placement of foreign 
students in appropriate programs. 
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| THE PREDICTIVE VALIDITY OF GRE SCORES FOR A 
Amona those who work in graduate education there is almost 
unanimous agreement as to the desirability of identifying promising 
[ graduate students prior to admission to doctoral programs. "This 
study was undertaken to determine the relevance of the Graduate 
Record Examination as an admission standard for doctoral study at 
Colorado State College. The investigators elected to approach the 
problem by evaluating GRE scores as predictors of success in grad- 
uate school, but they recognize that there may be other relevant 
bases for using these scores as admission standards. The study is 
believed to be unique in at least two respects: (1) It is the first 
such study at Colorado State College, an institution with an un- 
usually large doctoral program in education, and (2) the investi- 
gators sought to develop new and useful criterion variables not 
used in previous research of this sort. 
Sample. The study was restricted to doctoral students who either 
graduated or were dismissed from the program during a recent three 
Year period. A total of 231 persons were identified who successfully 
completed the doctoral program and for whom all variable: 
terest were available. There were also 21 subjects who were form- 
ally admitted to the doctoral program, completed a minimum of 
30 quarter hours of course work at the doctoral level, and then were 
formally dismissed from the program. There was a very large group 
of students who were discouraged from pursuing the doctoral de- 
Bree at an earlier point in their programs, and another very large 


s of in- 
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group who were unable to complete the degree for reasons not known 
to the investigators—these two groups were not included in the 
research. 

Predictor and criterion variables. The predictor variables in- 
cluded the two GRE aptitude scores (verbal and quantitative), the 
three GRE area scores (social science, humanities, and natural sci- 
ence), plus the GRE advanced education score. Four criterion vari- 
ables were used: (1) grade point average in doctoral studies, (2) 
graduation versus dismissal from the program, (3) normative judg- 
ment analysis, and (4) ipsative judgment analysis. In the norma- 
tive JAN approach, profiles of test scores and other predictor vari- 
ables on 30 representative students were presented to each of 16 
graduate professors who served as judges. Each judge was asked to 
rate each profile (the students were not identified by name) on a 5- 
point seale with respect to the student/s prospects as a doctoral 
student. In the ipsative JAN approach, the same 16 professors were 
presented with the names of the students who graduated, asked to 
identify ten whom they knew, and then to rank them on a one-to- 
ten scale on the basis of professional promise. The judges did not 
have access to the test scores and other profile data on these stu- 
dents, and it was the intent of the investigators that the ratings 
would be loaded with personality factors not readily accessible in 
other criteria. 

Findings. The findings are reported in Table 1, which includes 
both the zero order product moment correlation of each predictor 
with each criterion and the multiple correlation coefficient obtained 
when the six predictors were combined in a single prediction equa- 
tion. The zero order correlation coefficients that were significant at 
the 0.01 level are marked with an asterisk. In every case, the mul- 
tiple correlation coefficients were statistically significant. 

With all except the normative JAN criterion, the correlations 
were quite low—low enough to raise serious doubts about the pre- 
dictive validity of GRE scores for this particular doctoral program. 
The findings with respect to normative JAN suggest considerable 
agreement among the graduate faculty with respect to admissions 
policy, although its usefulness in identifying potentially successful 
graduate students has not been demonstrated. The investigators 
suspect the correlations would have been somewhat higher were it 

not for the fact that the GRE aptitude scores were used for sereen- 
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TABLE 1 
Correlations of Siz GRE Scores with Four Criterion Variables 


FOUR CRITERION VARIABLES 


SIX Grade Whether Normative Ipsative 
PREDICTOR Point or not Judgment ^ Judgment 
VARIABLES Average Graduated Analysis Analysis 

GRE Verbal Aptitude 0.32* 0.21* 0.38* 0.26* 
GRE Quantitative Aptitude — 0.21* 0.28* 0.27* 0.17* 
GRE Social Science Area 0.14 0.20* 0.53* 0.23* 
GRE Humanities Area 0.24* 0.24* 0.16 0.16 
GRE Natural Science Area 0.15 0.19* 0.49* 0.14 
GRE Advanced Education —— 0.28* 0.26* 0.17* 0.30* 
Six predictor Multiple R 0.39 0.34 0.60 0.32 


* Significant at or beyond the .01 level. 


ing the subjects prior to admission. GRE scores are being retained 
as admissions variables at Colorado State College, and a cross val- 
idation study is planned. 
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THE RELATIONSHIP OF PERFORMANCE ON 
OBJECTIVE ACHIEVEMENT EXAMINATIONS TO THE 
ORDER IN WHICH STUDENTS COMPLETE THEM 


JOAN J. MICHAEL 


California State College, Long Beach 
AND 


from a study of ten college and university classes within three dif- 
ferent departments of educational psychology whether there was 
à significant relationship between level of performance of students 
on objective achievement examinations with generous time limits and 
the order in which they finished their tests. In a preliminary in- 
vestigation in 1964 involving use of an open-book objective final 
examination in an introductory measurement and evaluation 
ORE] the second author found a significant curvilinear relation- 
Ship between the order in which 90 students turned in their papers 
and their scores. Relative to whether they were among the first, 
Second, or last third of the examinees to turn in their answer 
Sheets, the numbers who ranked in the top, middle, or bottom 
thirds of performance were, respectively, 8, 12, and 10; 16, 10, 
and 4; and 6, 8, and 16. Thus, it appeared that those students who 
took a moderate amount of time were most likely to earn the highest 
scores, The implication of this finding was that the validity of 
Scores on an essentially untimed achievement examination might 
bs related to how long students spend in taking a test and in re- 
Viewing their responses. In light of this finding, it was thought ap- 
Propriate to replicate the study in nine additional classes, in certain 
ones of which the final examinations were open-book and in others, 
losed-book. 

Method. As indicated in Table 1, both open- and closed-book 
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Relationship between Placing in Top, Middle, or Bottom Third in Examination Performance 
and Falling in First, Second, or Third of Those Completing the Examination 


TABLE 1 
Chi Square Values and Associated Probability Levels for 8 X 3 Contingency Tables Showing | 


Class and Level: Number Open- or 
Upper Division (UD) of Closed-Book x? 
or Graduate (G) Items Test N (df =4) Probability 
1. Introductory 
Measurement (UD) 105 Open 90 13.6 .001 « » <0 
2. Introductory : 
Statistics (UD) 80 Open 27 7.5 10 <p< A 
3. Introductory 
Statistics (UD) 80 Open 26 7.9  .05 <p<.ld 
4, Elementary 
Statistics (G) 99 Closed 25 1.5 80 <p <0 
5. Intermediate 
Statistics (G) 95 Open 40 7.8 .0 <p< M 
6. Psychological 
Foundations (UD) 25 Closed 25 5.2 20 «p«X9 
7. Educational 
Psychology (G) 140 Open 90 7.6 .10 «»«49 
8. Educational 
Psychology (G) 80 Closed — 22 58 .2: <p< 3 
9. Educational P| 
Psychology (G) 140 Open 25 42 830 «P4. 
10. Educational 9l 
Psychology (G) 140 Closed 24 15 .80 €2$: 
tests were administered to undergraduate and graduate classes. 
"These examinations consisted of objective items ranging from 25 to 
140 in number. In most instances, items had been revised and edited 
in light of findings from previous item analyses. An effort was made | 
to incorporate items that would tap higher level processes such as 
comprehension, interpretation, and problem-solving rather than 
the recall of factual information. Students were permitted to tum 
in their papers whenever they had finished and after they had had 
ample opportunity to go over their answers. A number 1, 2, oe h 
was assigned to each answer sheet to indicate the order in whic 
the test was handed to the examiner. There were no scoring penalties 
for so-called guessing. 
The statistical analysis involved (1) the drawing of e. P 
etion 


showing the regression of performance upon order of comp r 
and (2) the formation of three-by-three contingency tables to de- 
termine through use of the chi square statistic the probability vo 
significant relationship. 
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Findings. The results of the statistical analysis are shown in Ta- 
ble 1. Replication of the initial 1964 study, the results of which 
are shown in the first row, reveals that for the other nine groups 
there were no values of chi square significant at or beyond the 
05 level. Inspection of the scatterplots and contingency tables 
showed no consistent patterns or relationships. In the larger sam- 
ples, sex differences were negligible. Thus, with the one exception 
of the first sample, there was a lack of significant relationship be- 
tween the level of performance and order of completion of examina- 
tions. Although the results of this investigation were negative, it 
seems worthwhile to report them so that the occasional appearance 
of a significant result in an isolated study will not generate a Type I 
error in the professional literature. 

Discussion. The absence of a demonstrable relationship between 
test performance and time of completion may have one positive 
outcome: students who see others turn in their examination papers 
early need not panic. They may take some consolation in the evi- 
dence that the order in which they turn in their papers will prob- 
ably not be related to how well they do. However, it may be that 
separation of individuals into levels of standing on certain measur- 
able constructs such as general intelligence, need achievement, or 
test anxiety might reveal differential levels of test performance rel- 
ative to each of several designated temporal intervals during which 
students turn in their answer sheets. Use of improved experimental 
designs incorporating several independent and/or classifactory vari- 
ables might show important interactions not apparent in the data of 
the ten samples reported in this paper. 
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A SITUATION DESCRIPTION QUESTIONNAIRE FOR 
LEADERS 


GARY YUKL 
Sacramento Btate College 


In. recent years, social scientists have become aware of the im- 
portant role played by situational variables, both as an influence on 
poup behavior and as a modifier of the relation between leader 
behavior and group performance. However, discussion of the im- 
Pertance of situational variables has not been accompanied by a 
tebterted effort to develop comprehensive models and measures of 


In an attempt to identify and measure situational variables other 
than the usual structural variables discussed in the management 
Mersture (e.g, Porter and Lawler, 1965), a leader situation deserip- 
Men questionnaire was developed and validated, This 52 item 
quastionnaire consisted of questions that appeared to be relevant to 
the behavior of formal leaders (e.g, task attributes, serioume of 

errors for the organization, upward and downward 


The selection of items was influenced by an earlier analysis of group 

by Hemphill (1956), a factor analysis of task dimes- 
Mem by Shaw (1963), and leadership research carried out by 
Visier (1968). 


‘The new leader situation deseription questionnaire (LSDQ) war 
"Ui to a sample of 101 first-line supervisors (“foremen") and 
Wtemd-line supervisors ("managers") in two wm 


Pies and a public utility. Twenty-one elected and appointed stu- 
et leaders in a largo university were also surveyed. The sper- 
Ven responses were dimension-analysed by means of tbe DCTRY 
fires for factor and cluster analysis (Tryon and Bailey, 1966). 
Stale sores were computed using tbe Heme with high factor loadings 
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TABLE 1 
Correlations among the Situational Variables 


1 2 2 4 5 
1. Task Difficulty 
2. Task Structure —.39** 
3. Cooperation Requirements —.03 .18 
4. Production Pressure .82** —.03 .26* 
5. Leader Power .35** —.06 .06 .07 
6. Error Cost .98** —.07 .25* .93** .30** 
N - 101. 
*P « 05. 
**P «0l. 


The dimension analysis yielded six oblique clusters which ac- 
counted for 91 per cent of the item communality.1 These clusters 
appeared to correspond to meaningful situational dimensions and 
were labeled and defined as follows: 


1. Task Difficulty: The number of highly developed skills required 
to perform the subordinates’ tasks, the susceptibility of these tasks 
to error, and the general perceived difficulty of the tasks. 

2. Task Structure: The number of ways in which the task can be 
performed and the degree of rapid performance feedback available 
to the leader. Highly structured tasks have little procedural vari- 
ability and considerable performance feedback. " 
3. Cooperation Requirements: The amount of subordinate role in- 
terdependence and the degree to which the subordinates depend 
upon each other for successful completion of their tasks. 

4. Production Pressure: The frequency and intensity of requests 
for faster decision-making or better group performance made by 
persons outside of the leader’s group. 

5. Leader Power: The capacity of the leader for rewarding sub- 
ordinates, and the extent to which the leader is authorized to give 
orders and enforce their implementation. 

6. Error Cost: The seriousness of performance errors for the or- 
ganization. 


The correlations among the six situational dimensions are A 
sented in Table 1. It is interesting to note that the empirical clus = 
analysis of “real-life” industrial tasks yielded task dimensions si 


r 
1The situation questionnaire, scoring instructions, and a table of Lys 
loading for the items is available from the author by request. 
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TABLE 2 
LSDQ Means for the Three Samples of Leaders 

LSDQ Scale Managers Foremen Students 
IL Od O mE 
Task Difficulty .36 —.63 —.39 
Task Structure —.06 .26 —.13 
Cooperation Requirement .10 —.07 —.21 
Production Pressure .21 —.18 —.53 
Leader Power .22 —.23 —.43 
Error Cost .20 —.05 —.02 


Note—The sample included 72 managers, 29 foremen, and 21 students. Item scores were 
standardized before scale scores were computed. 


ilar to those found by Shaw (1962) in his research using judges' 
ratings of "laboratory tasks." 

The means of the LSDQ scales for each of the three samples of 
leaders are presented in Table 2. In order to determine whether the 
LSDQ scales would discriminate between leaders known to differ 
with respect to the situational variables, the LSDQ scale means for 
three types of leaders were compared. The t-tests of differences 
between pairs of corresponding means are shown in Table 3. The 
differences were all in the direction one would expect, and the ma- 
jority of the differences were significant. 

Discriminant validation (Campbell and Fiske, 1959) was carried 
out by correlating the LSDQ scales with measures of leader char- 
acteristics which ideally will not influence the leader’s perception 
of the situational variables. Several personality and attitude scales 
Were distributed to the supervisors along with the LSDQ. These 
included an LPC scale measuring the leader's evaluation of his 
least preferred coworker and the leader’s task-orientation (Fiedler, 
1968; Yukl, 1969), a tailor-made F-scale measuring leader Authori- 


TABLE 3 


T-tests of Differences between LSDQ Means for the Three Samples of Leaders 
—— a 


5 'oremen vs. Managers vs. 
ISDQ Scale Hades pos Students 
see Beale (io 0. Managers a OL a HUE eae en 
Task Difficult T " 6.05** 
Task Stureture S ido AT 
operation Requirements S i82 2.43* 
troduction Pressure 2.15* 2.38* 5.87** 
E ader Power 3.13 ** 79 3.30** 
tror Cost, 1.52 2.86** 5.43** 


* 
ap .05 for a two-tail test. 
01 for a two-tail test. 
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tarianism (Adorno, Frenkel-Brunswik, Levinson, and Sanford, 
1950), the Firo-B scales measuring six interpersonal needs of the 
leaders (Shutz, 1958), and a short questionnaire requesting infor- 
mation about leader age and formal education. 

The preceeding analysis revealed that Authoritarianism cor- 
related with Task Difficulty (r = .24, P < .05) and with Leader 
Power (r = .33, P < .01). When the considerable difference in 
average age for the two groups of supervisors was controlled, age 
was correlated negatively with Production Pressure (r = — 31, 
P < .01). However, these were the only significant correlations, 
and they represent less than five per cent of the correlations be- 
tween the six LSDQ scales and the eleven leader variables. 

Summary. The LSDQ was found to measure six meaningful 
dimensions of leadership situations, and these dimensions appeared 
to have reasonable construct validity. The LSDQ and other meas- 
ures of these situational variables can serve as important research 
tools in the study of leadership and related topics in industrial psy- 
chology. 
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THE UTILITY OF IMPORTANCE WEIGHTS IN 
PREDICTING OVERALL JOB SATISFACTION AND 
DISSATISFACTION 


L. K. WATERS 
Ohio University 


Mzasunzs of overall satisfaction are typically obtained by sum- 
ming responses to individual items concerning satisfaction with 
particular aspects of the work situation. As Glennon, Owens, Smith, 
and Albright (1960) have pointed out, this procedure ignores the 
importance of each item to the respondent. If importance is 
4 meaningful dimension, then the response to each item should 
be weighted by the importance of the item to the employee. 
While importance weighting is intuitively appealing, it must be 
shown that the use of importance weights adds to the prediction of 
separately measured overall satisfaction. This implies that the im- 
Portance weighted sum of item responses and unweighted sum 
should not be highly correlated. In studies by Decker (1955), Ewen 
(1967), and Schaffer (1953) importance weighting did not add sig- 
nificantly to the prediction of overall satisfaction. Also, Mikes and 
Hulin (1968) found that an unweighted sum of job component 
Scores correlated higher with a termination criterion than an im- 
Portance weighted sum. In view of the negative evidence, the pur- 
Pose of the present study was to evaluate further the utility 
of importance weighting in the prediction of overall job attitudes. 

Method—Subjects, The subjects in this study were 126 nonsu- 
Pervisory female employees in one regional office of a national insur- 
ance company on whom complete data were available. The women 
ranged in age from the late teens to early sixties and all were at 
least high school graduates. 

Scales. The job attitude scales were administered to small groups 
of employees by the author during a single working day. The em- 
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ployees were assured that their individual responses would not be 
made known to the company. 

The scales were presented in booklet form. On the first three 
pages were scales concerned with degree of overall satisfaction with 
the job, degree of overall dissatisfaction, and degree of overall satis- 
faction/dissatisfaction. The overall satisfaction/dissatisfaction rat- 
ings were made on a 12-point scale ranging from extremely satisfied 
to extremely dissatisfied. The separate satisfaction and dissatis- 
faction ratings were made on a 7-point scale which consisted of 
the appropriate six points of the 12-point satisfaction/dissatis- 
faction seale plus a seventh alternative (not satisfied or not dis- 
satisfied). The next scales presented were the five Job Description 
Index (JDI) scales. These were not used in this study but were 
involved in a larger study concerning the two-factor theory of job 
satisfaction (Waters and Waters, in press). Following the JDI, 
the same list of eleven job factors was presented on three successive 
pages. Over the three pages, each job factor was: (a) rated on 
present satisfaction/dissatisfaction (using the same 12-point scale 
mentioned above), (b) rated for importance in contributing to 
feelings of satisfaction, and (c) rated for importance in contributing 
to feelings of dissatisfaction. Both importance ratings were made on 
a 6-point scale ranging from “of extreme importance” to “of no 
importance.” The eleven job factors were: competent supervision, 
considerate supervision, company policies and practices, co-workers, 
opportunity for growth and advancement, physical working condi- 
tions, responsibility on the job, recognition for work done, salary, 
sense of achievement, and the work itself. 

Variables. The following variables were used in this study: 


1. Sum of responses (SD) to the eleven job factors on the 12- 
point satisfaction/dissatisfaction scale (3 SD). 

2. Sum of satisfaction/dissatisfaction scale responses (SD) 
each factor multiplied by the importance (I.) of the facto 
contributing to feelings of satisfaction (X SD*I,). 

3. Same as above except importance (Ip) in contributin 
ings of dissatisfaction were used in the weighting (3 SD*Io)- 

4. Overall satisfaction rating. 

5. Overall dissatisfaction rating. 


for 
rin 


g to feel- 


Results and discussion. Table 1 gives the intercorrelations pat 
the two weighted sums and the unweighted sum, and their corel 
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TABLE 1 


Correlations among Weighted and Unweighted Job 
Component Sums and Measures of Overail Job Attitudes 


1 2 3 4 5 
1 zSD .93 .58 .63 —.54 
2 Z(SD'L) .65 .58 —.48 
3 2 (SD*Ip) 43 — 34 
4 Overall Satisfaction .65 


5 Overall Dissatisfaction 
94.43 500.22 424.38 3.55 1.68 
8 19.24 125.15 167.04 1.30 1.31 


tions with the overall job attitude measures. It is obvious that 
the unweighted and Is weighted sums were so highly corre- 
lated that very little difference could have been expected in 
their correlations with overall job attitude measures. The magni- 
tude of this correlation was consistent with those reported by 
Ewen (1967). If the relationship between a weighted and un- 
weighted sum of the same variable of this magnitude is consistently 
found, any claims for the superiority of importance weighted job 
Satisfaction composites seems quite unfounded. The correlation be- 
tween the Ip weighted total and the unweighted total was not so 
large as to preclude differences in the prediction of overall job at- 
titudes, 

If importance is an effective weighting method, then the Ip 
Weighted sum should correlate higher than the unweighted sum 
With overall dissatisfaction and the Ig weighted sum should cor- 
relate higher than the unweighted sum with overall satisfaction. As 
could have been predicted from the very high intercorrelation, 
Is Weighted and unweighted totals correlated very similarly. In 
the case of the unweighted and Ip weighted sums where a difference 
Was possible, the correlation of the unweighted sum with overall 
dissatisfaction was larger than the correlation of the Ip weighted 
Sum with the same overall dissatisfaction measure. The difference 
between the two correlations was tested for significance (Walker 
and Lev, 1953, p. 257) and yielded a t value of 2.92 (P < .01). 

Considered in conjunction with the studies by Decker (1955), 
Ewen (1967), Schaffer (1953), and Mikes and Hulin (1968), the 
Tesults of this study cast serious doubt on the utility of importance 
Weighting of job satisfaction components. It can be argued that 


‘ther the satisfaction with job related components or the impor- 
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tance of the components, or both, were not reliably measured in 
any given study, but the consistency across these studies suggests 
that this is not a very reasonable explanation of the negative re- 
sults. It should be noted that the only model investigated has been 
a multiplicative one. It may be that other procedures for in- 
corporating importance would yield more promising results. 
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Anne Anastasi. Psychological Testing. (8rd ed.) New York: Mac- 
millan, 1969. Pp. v + 665. $9.25. 


The standard of excellence of the first two editions of this text has 
been met and excelled in the third edition. The changes that have 
been made extend from the enlargement of page size to the in- 
ereased use of visuals, the deletion of outmoded terms, tests and 
language, and the expansion and clarification of difficult constructs. 

"The author is skilled in putting intricate ideas into simple, clear 
language. She can manage to define a complex idea in one sentence 
and move on to another in the next. The author's style makes for 
very efficient information gathering, however, the rapidity with 
Which ideas are presented and advanced, suggests that this text 
be used with intermediate undergraduate or graduate level stu- 
dents. Although terms are fully defined, it would appear that a 
ae in statistics should be a prerequisite before this book is used 
as a text. 

References are current and appropriately included in the text. 
Where the text has not been expanded, current references have 
been referred to in footnotes. In addition, sources for expanded 
discussions are frequently cited. 

The section on Validity has been expanded and in the reviewer's 
Opinion is an improved version over what was an excellent pre- 
sentation in the second edition. This is typical of the changes in 
the 3rd edition, that is, sections that were acceptable have been 
fully updated including ideas from recent theory and research. 

Additions to the text include discussion of Moderator Variables, 
4 section expanded into a chapter on Item Analysis, a greatly ex- 
panded section on Factor Analyses, a section on Convergent Dis- 
crimination Analyses, the inclusion of the Campbell and Fiske 
Multitrait-Multi-method matrix, a clarification of the terminology 
m Decision Theory, and a revised section on Clinical Judgement 
Whieh ineludes an excellent summary of the issues raised by 
Mechl’s book. Occasionally one wishes for a more critical evalua- 

Ion of some of the newer theories presented. 

The tests discussed are selected with care and are current. The 
author uses discussion of the tests to highlight test theory and 
Telated Problems, some of which were presented in earlier chapters. 
we the later chapters not only present new ideas but build on 
á at was discussed earlier. The author evaluates each test pre- 

ented in the book fairly and critically and has eliminated outdated 
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tests. A useful Classified List of Representative Tests is included 
in the appendix. The presentation of the discussion of the Special 
Aptitude tests has been reduced from two chapters to one by 
eliminating the section on Literary Appreciation and expanding 
the section on Creativity. 

Section Five on the methodological problems of Projective 
Techniques and Other Techniques for Personality Assessment 
contains one of the outstanding discussions of “personality” testing 
to date without the biting edge of some past discussions. A new 
section on Psychometric Instruments versus Clinical Tools expands 
and updates current trends. 

The final section, Social Implications of Testing Today, sum- 
marizes the recent attacks on testing, reviews the ethical use of 
tests and includes, in Appendix A, the Ethical Standards of Psy- 
chologists. 

Throughout this revision an emphasis is placed on the problems 
of testing and the theory and usefulness of a test rather than its 
history. In fact the only negative criticism one would make of the 
book is that the history section does not deal with the emergence 
of modern problems. 

The text, while an excellent summary of educational psycho- 
logical measurement problems, places its emphasis on the psycho- 
logical aspects of educational measurement. This is not a text for a 
beginning survey course in educational measurement and is not 
designed for teacher education. This should in no way detract 
from its usefulness. 

While this is clearly not a cookbook, if it were, it would be 
designed for the gourmet. 


NicnHoLas J. ANASTASIOW 
Indiana University 


R. Darrell Bock and Lyle V. Jones. The Measurement and Pre- 
diction of Judgment and Choice. San Francisco: Holden-D8y; 
1968. 370 pp. $13.75. 


In the Preface, the authors note that this book is based on more 
than a decade of research in the area of preference measuremen! 
methodology. Noting that the Thurstone type scaling methods have 
often been used with only moderate size samples of judges, m 
have provided significance tests and confidence intervals, 80 tha 
descriptive statistics from experimental studies can be correctly 
interpreted. The book gives a detailed and statistically sophisti- 
cated treatment of a few topics. The authors note that in ade 
to "detailed specification of the formal sealing models for tl 
methods of constant stimuli, paired comparisons, rank order, 80! 


Tw M 
= 


BOOK REVIEWS 527 


successive categories, we have attempted to provide corresponding 
methods for statistical inference.” 

It is assumed that the reader has a reasonable knowledge of 
mathematical statistics, calculus, probability and matrix theory. 
For the reader who wishes to obtain an adequate background, 
references to appropriate theoretical treatments are given. The 
reader who is interested in a manual of computing procedures 
for scaling will find that for most of the methods presented, the 
general formulas and the accompanying numerical illustrations give 
aclear account of the appropriate computations. In the first chapter 
there is a statement of the topics in scaling not covered in the book, 
and references are cited where such coverage is given. 

Considerable attention is given to the problem of reducing bias. 
For observed proportions of zero or unity, it is recommended that 
1/2N and 1-1/2N be used respectively. Several other suggested 
correction procedures have been tried out and found not to be as 
good as the “1/2N rule." For the logistic, the bias in using the 
usual formula y = 1n (p/q) has been studied and compared with re- 
sults from Anscombe's correction y = 1n p + (1/2N)/q + (1/2N). 
The latter formula is recommended because it removes bias for 
Population proportions that are not too near zero or unity. Cor- 
Tespondingly for the angular transform, y = 2 sin? Vp — (7/2); 
Several different possibilities are considered for p. It is concluded 
that the best one is p = (r + 1/4)/(N + 1/2), instead of the 
Usual p = r/N (where N is sample size, say, the total number of 
zeroes and ones, and r is the number of ones). 

A thorough presentation is given of the constant method and of 
ow it may be used to determine the point of subjective equality, 
the differential threshold, and the absolute threshold. It should be 
Noted that a relevant physical measure of stimulus intensity is 
essential for the constant method. For the constant method, re- 
fression analyses and confidence bounds are given and illustrated 
et the weighted minimum normit and for maximum likelihood. 
Ne computationally simpler methods are also presented, a graphi- 
é Procedure and the unweighted minimum normit, with the in- 
^ cation that neither of these methods "provide statistical tests or 
terval estimates on the parameters of the psychophysical model." 
P procedure presented is characterized as to whether or not it 

Unbiased, efficient, sufficient, or consistent, ete. ^ 
7 e method of paired comparisons and the law of comparative 
CER is also presented, together with a number of balanced 

Ee block and partially balanced incomplete block designs 

Teduce the number of judgments made by the subjects. 
with ^ method of successive categories, or successive intervals, 

© associated law of categorical judgment is also given. 
Or these methods a relevant physical measure, if present, is 
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used only as a tag; such a nominal scale is necessary for the mel 
of paired comparisons and successive categories. ] 

There is an interesting extension of the comparative judgm 
model to incorporate the “response surface” model, as develo 
by George E. P. Box and his co-workers and others, and also 
include a factorial model. The factorial model is illustrated 5 
an interesting example involving the relative merits of two ley 
of each of four factors: merit incentive, weekly wage (vs. hourly 
piecework incentive, and pay class. 

The last chapter presents the application of these mod 
predietion of first choices (from sets larger than two), pred 
of choice for compound objects, and prediction of lunch purchas 
from successive category ratings. 

All of the methods presented are based on Thurstone’s 
concept of a discriminal process characterized by a disc! 
dispersion and a central tendency. For each of the methods | 
sented, the particular variant of this general model is acc 
specified, and the appropriate experimental procedures for @ 
lecting data to be processed by this model are specified in d 
The analysis of variance procedure appropriate for each m 
is presented in general terms and is also illustrated with data 
an experiment. 

Numerous concrete illustrations of experiments in psycho! 
Measurement are given, such as, measuring differential thr 
for concentrations of salt or sugar, obtaining absolute thresht 
for sodium bicarbonate or sodium benzoate, the flavor of pi 
influenced by the proportion of corn and peanuts in the ra 
preferences for canned beans or for cottage cheese as influen 
by the percentage of salt and sugar. Other experiments analy 
dealt with preferences for various cereals, for various gifts, 10 
various lottery tickets that might win specified gifts; prefe 
for various labor-management provisions with respect to sal 
vacation, and retirement income; and differences in the feel ( 
wool as a function of the detergent in which it was washed. E 

The final chapter gives a series of interesting applications 0 
sealing methods to the "prediction of choice.” The methods 
sented are illustrated with experiments on prediction of first che 
among sets of food, or with purchases of lunch at a club 887 
dicted from scaling experiments, and with preferences for 
tickets for different gifts. 

For each method where appropriate, the significance tests 
the confidence intervals are presented. There is no emphast 
the fact that as the number of cases increases, there is & HEN 
toward finding increasingly “significant”. differences between 


data and a hypothesis, and, correspondingly, a decrease in ™ 
Kx 
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width of confidence intervals. Possibly this is because it is assumed 
that readers have achieved such a level of statistical sophistication 
that it need not be mentioned. The possibilities opened up by a 
components of variance analysis giving an estimate of the relative 
magnitudes of the variances that are, or are not significantly, dif- 
ferent is not stressed (Gulliksen and Tukey, 1958). 

There is no reference to the possibility of regarding each judge 
as a vector in some multidimensional space, and each stimulus 
as a vector in the same space, and obtaining the scale value of each 
object for each judge as the scalar product of the two vectors 
(Tucker, 1964). 

Similarly, there is no presentation of the theory developed by 
Coombs (1964) and others which regards each person as a point 
in a uni- or multidimensional space, and each object as a point in 
the same space, the preferences being given not by a product of 
vectors, but by the distance from the person to the object. Prob- 
lems and methods of multidimensional scaling are also not in- 
cluded in the scope of this book. 

In summary, we have here an excellent theoretical and statistical 
treatment of the major methods of unidimensional scaling, where 
the investigator has predefined the population whose judgments 
or preferences are of interest. The treatment is limited to the three 
important experimental methods (the constant method, paired com- 
parisons, and successive categories). The appropriate theory (the 
law of comparative judgment and the law of categorical judgment) 
is also stated with appropriate modifications for different variations 
in the experimental procedures. Also the research worker who 
does not have an interest in the theoretical development will find 
4 clear statement and illustration of the experimental and comput- 
ing procedures with numerous tables to expedite the data analysis. 

Or some time to come, this book will be valuable to those in- 
terested in the theory and the methods or unidimensional sealing. 
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Albert B. Hood. “What Type of College for What Type of Stu- 
dent?" Minnesota Studies in Student Personnel Work. No. M. 
Minneapolis: University of Minnesota Press, 1968. Pp. vi + 84. 
$2.75 (paper back). 


There ean be no doubt that American higher education is typi- 
fied by great diversity, both in terms of types of colleges and types 
of students applying to them. From a variety of standpoints it 
Seems reasonable to assume that some types of colleges must be 
better than others for certain students. With this assumption, Hood 
raises the question, “What types of colleges for what types of 
students?” 

The participants in this study included 6,959 males and 5,446 
females who graduated from Minnesota high schools (public or 
private) in 1961 and who completed at least one semester or quar- 
ter in an accredited four-year college or a public junior college in 
Minnesota during the academic year 1961-62. 

During their senior year in high school, the participants com- 
pleted a questionnaire developed by Berdie (1954) entitled “After 
High School-What?” Thus, information was obtained regarding 
the family, economie, cultural, and social backgrounds of the par- 
ticipants, Berdie's questionnaire was modified to include 29 per- 
sonality items, 25 of which were the highest validity items from 
the social relations and conformity scales of the Minnesota Coun- 
seling Inventory (Berdie and Layton, 1953). In addition to these 
measures, high school percentile ranks (HSR) and scores on the 
Minnesota Scholastic Aptitude Test (MSAT) were available on 
all participants. 

At the end of the academic year 1961-62, freshman average 
grades were obtained on all participants and served as the criterion 
measure in this study. B 

The institutions participating in this study included the eight 
private liberal arts colleges in Minnesota, the three Catholic n 
colleges (liberal arts), the four Catholic women's colleges (libera 
arts), the five publie state colleges, the ten public junior colleges 
the one private junior college, and seven colleges of the University 
of Minnesota, 

For each institution, frequency distributions, means, and "t 
centages (when appropriate) were computed for each of the de 
ferent ability, socioeconomic, and personality variables. After th 
differences among types of institutions in these variables had been 
noted, the extent to which these variables were related to e e 
achievement in the various types of institutions was found. Fina * 
multiple regression analysis was used to determine which of t 
variables were related to achievement in the different types S 
institutions after ability and previous achievement record had bee 
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taken into account. This analysis also yielded the maximum variance 
in the prediction of freshman achievement when all predictors— 
cognitive and noncognitive—were taken into account. 

These analyses were focused on seven specific areas: academic 
achievement levels in different types of colleges, socioeconomic 
factors and college choice, academic achievement of farm stu- 
dents, personality characteristics of students in different types of 
colleges, academic achievement of rebels in different types of 
colleges, and academic achievement of introverts in different 
types of colleges. 

Among the most interesting aspects of the study in the re- 
viewer’s opinion was the attempt to calculate a “difficulty index” 
for comparing grades at one college with grades at another college. 
This index was computed from the means and standard deviations 
of ability measures obtained during high school. 

It would be interesting to apply Hood’s procedure for computing 
the “difficulty index” using measures of achievement and ability ob- 
tained during college rather than high school. Furthermore, Hood’s 
results should be cross-validated before definite claims are made. 

The study presents a variety of other interesting findings and 
conclusions concerning the seven issues outlined above. The ma- 
Jority of the findings have been reported in previous monographs 
and articles by other authors, but there were some surprises con- 
tained in the study. For example, Hood hypothesized that uni- 
versity students would tend to be more rebellious and less con- 
forming than students in smaller church-related colleges, but the 
data did not bear this out. In fact, students entering the Catholic 
men’s colleges appeared more rebellious and less responsible than 
men from other liberal arts colleges. 

, It is interesting to note that the majority of “unexpected” find- 
Ing emerged from the consideration of personality characteristics 
of students, The surprises came primarily when personality “types” 
in different kinds of colleges were investigated. There were few 
Unexpected findings when the relationships between personality 
Variables and academic achievement were investigated. 

x ishman (1962), in commenting on the literature produced by 

prediction studies of 1948 to 1957, said that 


The most usual predictors are high school grades and scores 
on a standardized measure of scholastic aptitude. The usual 
Criterion is the freshman average. The average multiple correla- 
Hop obtained when aiming the usual predictors at the usual 
criterion is approximately .55. The gain in multiple correlation 
peon adding a personality test score to one or both of the usual 
Predictors, holding the criterion constant, is usually less than + 


05. (Fishman, 1962, p. 669). 
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While Hood's average correlations and gains do not exactly sub- 
stantiate Fishman’s figures, his results are close enough to suggest 
that new methods deserve trial in investigations of the relation- 
ships between personality variables and academic achievement. 

Another criticism which might be leveled by some is that aca- 
demic achievement was the sole criterion considered in the study, 
Hood is careful to point this out, but goes on to recognize that 
“it (academic achievement) is critical both for remaining in col- 
lege and for gaining admission to graduate and professional schools” 
(p. 4). In this day when “relevance” is the cry, one wonders 
whether such practicality is sufficient. It is obvious that things 
other than academic achievement are important; it is equally 
obvious that the criterion problem is a major one. 

In an important concluding chapter entitled “Implications for 
Educators,” Hood points out that while “this study showed that 
certain aspects of personality were related to achievement in 
college . . . in general, the relationships were similar in all the 
different colleges. If certain personality characteristics of a stu- 
dent will either help or hinder his academic achievement, they 
are likely to do so at whatever college he attends.” (p. 79). This 
conclusion is entirely justified by the data presented in the study, 
but it is important to note that this study typed colleges with 
reference to such variables as source of control, curriculum, sex 
of student body, ete. i 

While these variables are no doubt important in determining the 
environmental characteristics of colleges, recent efforts directe 
toward identifying and assessing such variables has provided re- 
searchers and decision makers with valuable new tools. These 
efforts might be characterized as sociological approaches to college 
environments in that they purport to measure “press” oa 
psychological forces operating within the campus community. d 
is entirely conceivable that two state colleges might be characterize 
by different socio-psychological forces and therefore affect e 
students in different ways. Thus in determining the impact 0 
college environments on student behavior, i& would seem piod 
tant to examine the interaction in terms of personality v 
of students and socio-psychological variables of the campus rat i 
than typing colleges only according to source of control, oun 
lum, sex of student body, etc. For a more detailed discussion b 
this issue, the reader is referred to Fishman's chapter in The Ame 
can College (1962). for 

In conclusion, it may be said that What Type of Con 
What Type of Student? hardly answers the question it raises: "| 
the question is an important one, and Hood has made & Co 
able attempt to address it. The monograph should be of interè 
to researchers, counselors, and administrators alike. 
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John E. Horrocks and Thelma I. Schoonover. Measurement for 
po Columbus: Charles E. Merrill, 1968. Pp. v + 645. 
8.95. 


In 1964, Assessment of Behavior, a quality publication joined the 
group of textbooks available for courses in psychological measure- 
ment, courses combining psychological and educational measure- 
ment, and courses designed for guidance counselors and school 
psychologists. Now, the author, John Horrocks, has joined with 
Thelma Schoonover in the preparation of a new book jn the area 
of educational measurement, one based heavily on the first. It is 
designed for “a standard course in educational measurement or 
evaluation, or for a course in measurement for guidance counselors 
48 well as a reference for the practitioner.” 

The heritage of the second publication is obvious. Nine of the 
twenty-four chapters are entirely new. Some of the remaining 
fifteen chapters have been rewritten, others are the same as corre- 
sponding chapters in the 1964 volume, with the exception of minor 
changes, The authors forthrightly state that whenever a discussion, 
graph, or table in the original book was pertinent to the purpose of 

€ new volume, it was retained on the assumption that there is no 
ih. in making changes purely for their own sake. Hence, those 
the rs acquainted with the 1964 volume will find the content of 
n 1968 publication to be familiar territory, and no doubt will 

ove through it with ease. 3 
maging a book written for a specific audience so that it is 

ble for another is a task which is very easily underestimated. 
tion un freshness of approach and imaginativeness of presenta- 

iid at buttresses the quality of the first effort may prove to be 

ead weight which an author carries as he addresses himself 
^ Dew audience on essentially the same topic. One wonders 


Ww : 
hether this may have happened here. 
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True, the authors have greatly curtailed the discussion of some 
topics in the 1964 publication. Examples are intelligence testing 
and the measurement of personality characteristics. Furthermore, 
they have expanded other sections when preparing the 1968 publi- 
cation. An excellent example of this is achievement testing, But did 
this occur to a sufficient degree? Not in the minds of some critics. 

By taking modest liberties with the intent of certain chapters, 
it is possible to consider the twenty-four chapters as grouped into 
six subdivisions. For example, the first four chapters serve as an 
introduction to educational and psychological measurement, and 
consititute one subdivision. By and large, these are well done. 


Particular mention should be made of the chapter on individual . 


differences and its relationship to measurement, a topic not normally 
covered in books of this type. On the other hand, one disappoint- 
ment in the treatment of validity and reliability is the failure of the 
authors to use the 1966 publication by French and Michael regard- 
ing standards for educational and psychological tests rather than 
the 1954 statement mentioned in Chapter 4. The revamped termi- 
nology regarding validity, for example, would be helpful. Nec 
A rather traditional treatment of achievement testing is given In 
Chapters 5, 21, and 22, and can be thought of as the second sub- 
division. This discussion is not as extensive as often found, an 
might be weakened by the fact that the chapters are separated. 
Seven chapters (namely, Chapters 6 through 12) are devoted to 
readiness testing and standardized achievement testing of a survey 
type. These chapters are strong in many ways. The elementary 
school teacher, in particular, will be pleased by the attention given 
reading, mathematics, and the language arts. This subdivision 0 
the publication should be considered one of its strengths, ang ve 
provide excellent tools for instructors who believe in an mp 
emphasis on testing for readiness, and on the use of standar a 
achievement tests in the basic content areas such as the sole tu- 
The fourth discernible subdivision concerns the testing of § 13 
dent aptitudes, principally general mental ability. In Cheni 
through 16, the reader will find well-developed presentations i n 
nature of intelligence and its measurement by both indivi ‘the 
group tests. The last chapter of the group is devoted per i 
measurement, of other aptitudes by means of various tests, in 
ing differential aptitude batteries. is kind 
1 One of the truly diffieult topies to handle in à book of this a 
is the measurement of personality traits. The authors e Bior 
pared four chapters in this subdivision, thereby covering the ae 
points regarding the nature of personality as well as the R an 
ment of commonly mentioned aspects of it such as wee has 
interests. The chapter devoted to rating scales (Chapter 
promise. 
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The last of the six subdivisions is a traditional one, namely the 
discussion of testing programs and the use of the results of these 
programs with school personnel and members of the community. 
The two chapters in this subdivision are comparatively short but 
adequate for the book when its global purposes are considered. 

‘As one studies the publication by Horrocks and Schoonover, 
he probably will be impressed with a number of features which 
reoccur with some regularity. For instance, the bibliographies at 
the end of each chapter are longer than one might expect, some 
times exceeding 100 entries. To be sure, these are comprehensive. 
Perhaps a more effective approach would be to shorten the bibliog- 
raphies, thereby becoming more selective, and adding annotations. 

Another characteristic of the book is its comparatively light 
emphasis on statistical methodology as applied to measurement. 
The authors state that “a course in measurement is not a course in 
statistics.” This point of view has been consistently followed in 
the book even though minor slips occur; for example, the somewhat 
lengthy discussion of the normal curve (see pages 42-43). 

Still a third characteristic of note is the repeated attempt by the 
authors to present up-to-date, useful information about available 
standardized tests. Rather long tables are included in a number of 
chapters in which the names of such tests, the names of the authors, 
and the publishing companies, as well as a brief statement of the 
features of the tests are given. Such information is often reported 
in the appendices of an educational measurement publication. No 
doubt, the approach used by Horrocks and Schoonover is an im- 
provement over that ordinarily found. 

In summary, it is clear that the authors have presented psycho- 
logical testing as applied in education in quite a systematic manner. 
In regard to standardized tests in achievement, aptitude, and per- 
sonality, the treatment has been very comprehensive—probably 
^ or henstey for a one-term course in educational measure- 

ent. 

The comparatively heavy emphasis on standardized testing as 
compared to teacher-constructed tests will be viewed negatively 

Y many instructors of courses in this field. Students and teachers 
alike often wish to begin their study with consideration of the 
principles and guidelines of teacher-constructed achievement tests. 
Xperience shows that may students gain a deeper understanding 
and appreciation of all of psychological testing by having been 
exposed almost immediately to the joys and hazards inherent in 


. the development of achievement tests in subject-matter areas very 


familiar to them. ! 
ms instructors using this book may want to supplement it 

any one of several available publications in basic statistical 
methodology as applied in psychological testing. This supplementa- 
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tion added to a thorough educational experience based on the 
book by Horrocks and Schoonover will indeed equip the fledgling 
classroom teacher with a working knowledge of educational meas- 
urement, probably one noticeably superior to that typically found 
among the newly minted classroom teachers today. 


J. STANLEY AHMANN 
Colorado State University 


Samuel Levine and Freeman F. Elzey. A Programmed Introduction 
to Research. Belmont, California: Wadsworth, 1968. Pp. ix + 
236. $3.95 (paperback). 


Levine and Elzey have produced a book in linear programmed 
format that should be useful to the lowest quintile of seniors and 
first year graduate students who are required to complete a course 
in research. Basically this material is an introduction at a simple 
level to the introductory vocabulary of research. 

The book is divided into eighteen sets or chapters. The first 
deals with scientific verification, prediction and explanation. In 
general this material teaches the student to discriminate between 
scientific and nonscientific investigations. When the book is revised 
it would be desirable for the authors to sharpen discriminations, 
e.g. they might consider illustrating faith as contrasted with scien- 
tific enquiry within a single area such as the faith of a patient m 
a prescribed drug and the scientific assurance of the pharmaceuti- 
cal researcher. $ 

The first set also deals with prediction and explanation. Levine 
and Elzey over-simplify their illustrations. For instance, in frame 
44 they say, “An uncorrected visual defect is a plausible explanation 
for not having learned to read.” Actually some uncorrected visua 
defects, particularly for distortion of far vision, are associate 
with comparative improvement in reading as children compensate 
for weakness in sports. The book would be stronger if these over- 
simplifications were eliminated. 

The second set deals with inductive and deductive vocabulary. 
Levine and Elzey make a major point out of the desirability © 
having a number of cases before generalizing. There is in the boo d 
an implieation that a sufficient number of observations Wl lea 
automatically to a correct generalization. There is an implication 
that after major and minor premises are set up they are accepte 
and predictions deduced are tested without attention to the ne 
for testing the premises themselves as possible sources of error. 

In the third chapter on concepts, indicators, and instanc 
authors present concepts as real things with no attention to 3 
process of conceptualization as a likely source of error. The use a 
IQ as a stable verity that can be taken at face value probably does 


es the 
the 
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disservice to the students who are most likely to find the book 
useful. The bald statement that the concept of memory is less 
abstract than the concept of intelligence raises questions concern- 
ing the authors' understanding of each of these ideas. The state- 
ments, “Because standardized intelligence tests exist, instances of 
varying levels of intelligence would not be difficult to obtain,” 
followed by: “On the other hand, the objective verification of 
instances of varying degrees of anxiety would present a problem,” 
lead to speculation about polygraphs which indicate anxiety and 
are accepted as evidence in courts of law while IQ scores are not 
accepted with comparable reverence. 

The chapters on dependent and independent variables-Chapter 
4, hypotheses-Chapter 5, sampling-Chapter 6, stratified sampling- 
Chapter 7, control by randomizing-Chapter 8, control by homo- 
geneity-Chapter 9, control by matching-Chapter 10, all give 
vocabulary building exercises relative to the topics listed. 

Part 5, including Chapters 11, 12, 13, include the terminology 
most obviously necessary to discuss the use of statistics in research. 
The balance of the book gives practice in the use of vocabulary 
germaine to research design. 

If the book is used as an introduction to terminology by stu- 
dents who need help over and beyond that provided by a standard 
text, A Programmed Introduction to Research can become a use- 
ful adjunct to the instructor and save him time presenting vocab- 
ulary he thinks should have been obvious. It would be unfortunate 
if a student were limited in his introduction to research only to 
the material in this text. 


Joun A. R. WiLson 
Professor of Education. 
University of California 
Santa Barbara 


William E. Mendenhall. Introduction to Probability and Statistics. 
(2nd Ed.) Belmont: Wadsworth, 1967, Pp. xiii + 393. $11.95 
and $8.95 (text). 


The latest edition of Mendenhall’s elementary statistics text is 
excellent, It is probably typical of reviews such as this that com- 
ments are made on the clarity of the text under discussion. How- 
Ua it seems necessary to note that Professor Mendenhall’s writ- 
ing is considerably above the average in lucidity. The writing 


. spare and lean, there is no verbal fat in the book, and the topies 


are Covered succinctly and adequately. Concerning the level of 
Preparation, the sophistication required of readers of the text is not 
8 TY high. Some reasonable training in elementary algebra would 
fem to be all that is required. 
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The coverage of the book is classical: It includes introductory 
chapters which define the role of statistics in making inferences 
from data, and which review some of the elements of elementary 
algebra. The remainder of the book includes the tabular presenta- 
tion and arithmetic summary of bodies of data, elementary prob- 
ability, random variables and concomitant distributional notions, 
the binomial and normal distributions and large sample interval 
and point estimation within those distributions. Small sample theory 
introduces the ¢ and x? distributions which are followed by dis- 
cussion of regression and correlation, use of the x? distribution 
with frequency data, the analysis of variance, and nonparametric 
statistics. 

Only a few things need be said about the book, since it is a solid 
and accurate textbook, which neither breaks new ground nor makes 
important errors. The one point which is made early and emphati- 
cally in the book is that the whole intent of statistics is inferential. 
This point is exemplified in a number of contexts, and reference 
is constantly made to collections of units or measurements larger 
than the sample. The notion of a probability distribution is also 
introduced in this context. 

In the chapter on the analysis of frequency data, one important 
but frequently ignored point is made. That is the difference m 
logic between the case when the marginal frequencies in one di- 
mension of a 2-way contingency table are fixed and the case of 
both dimensions being free to vary. It is unfortunate that the 
similarity of this distinction with the distinction between the 
statistical theory for correlation and regression coefficient esti- 
mates in continuous data is not mentioned. 

A final point to illustrate the coverage and succinctness of the 
book. The chapter on the analysis of variance runs for twenty-one 
pages. In that space, the basic notions of analysis of variance, m- 
cluding estimation and hypothesis testing in the completely ran: 
domized and the randomized block designs are presented. In HAN 
Statistics for Psychologists, the section on fixed effect analysis 0 
variance, short of the randomized block design is over thirty pages, 
and the pages are larger! Admittedly, the main drawback of v 
book, which presumably was intentional, is the absence of mo * 
specification. In view of Mendenhall’s excellent book on the linea! 
model it seems unfortunate that some of this emphasis does n0 
appear in this book. d 

The book may be purchased with an accompanying programme 
study guide which these reviewers did not attempt to assess. 


Davi E. bg ne 
Wir11am H. SCHMID? 
The University of Chica? 


à 
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William Mendenhall. Introduction to Linear Models and the De- 
sign and Analysis of Experiments. Belmont: Wadsworth, 1968. 
Pp. xvi + 465. $14.60 and $10.95 (text). 


The emphasis of Introduction to Linear Models and The Design 
and Analysis of Experiments is on the practical use of linear models 
in the design and analysis of experiments. The book would serve as 
a good textbook for an introductory course on experimental de- 
sign for students in the behavioral sciences. The approach is prac- 
tical and intended for students with a wide range of mathematical 
ability but presents a relatively rigorous and logical development 
of the theory. 

, The comment relevant to the applicability of this book to an 
introductory course in design needs further clarification. The 
traditional approach to design in the behavioral sciences has been to 
introduce the subject in the context of the analysis of variance. 
The usual practice is to state the randomization procedure and the 
resolution of the sums of squares appropriate to the design. An 
ANOVA table is then presented which summarizes this resolu- 
tion and indicates the methods of testing important hypotheses. 
One often finds the student memorizing formulas without under- 
ganding the nature of the theory associated with the underlying 
odel. 

Mendenhall’s notions of model building consist of: specification 
of the model and design, estimation of the parameters of the model, 
parameter hypothesis testing and the determination of the lack 
of it. These are presented in a logical fashion leading the student 
through some of the most important notions of applied statistics. 
Fixed models which are linear in the parameters and the corre- 
sponding least squares theory receive most attention in the book. 
This approach to design and analysis gives the student a framework 
Which is useful in many experimental situations. is 

One way to assess Mendenhall’s book is to compare it with 

pial other texts on experimental design. Winer (Statistical 
dir in Experimental Design) covers the mixed model in- 
cluding those extensions of the split plot design useful in experi- 
mental psychology. Mendenhall does not deal with this topic to 
any great extent. 

Peng (The Design and Analysis of Scientific Experiments) is 
ct like Mendenhall. The topics are similar, the style resembles 
ee Mendenhall covers details of least squares theory 

Ed in Peng as well as other topics. 4 
tal dua and Cox (Experimental Design) approach experimen- 
d esign within the framework of the analysis of variance. Men- 

enhall covers most of the major topics (including, for example, 
Tactional factorial arrangements), however, his development is 
Not as complete. j 
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Mendenhall has a different approach than the other texts. The 
specification, estimation and testing of linear models as a basic 
approach to the design of experiments is relatively new as a basis 
for textbooks. Mendenhall’s book is an integrated and unified ap- 
proach appropriate for an introductory course in design. 

The emphasis is on the practical use of models in experimental 
design and contains numerous examples. This is one of the strong 
points of the book. Also worthwhile are the numerous exercises, 
most with answers. An appendix which summarizes several com- 
mon statistical tests and interval estimation procedures is included, 

Although numerous examples are included the source of 95% 
is the natural sciences and related areas. Few examples are drawn 
from psychology and education. Readers of this journal will be 
attracted, however, by the clarity of the examples and their rele- 
vance to the development of important statistical concepts. i 

The book covers a wide range of topics in experimental design. 
These include: linear models, least squares theory, classical designs 
(blocks, Latin squares), the analysis of variance, response surface 
designs, fractional factorial and incomplete block designs, mixed 
model and the random model. The discussion of the various designs, 
models and other topics is well illustrated being precisely and 
clearly written. This is certainly one of the outstanding features 
of the book. The delineation of the theory of factorial designs 
and the development of the concept of confounding are two €x- 
amples which serve to illustrate this point. 

The chapters on linear models and least squares theory were very 
good. The transition within these chapters from traditional nota- 
tion to matrix algebra is excellent. The random model is develope 
within the context of nested designs and seems too limited. | 

An outstanding feature of the book is the integrated develop- 
ment of linear models for both qualitative and quantitative varl- 
ables. The traditional fixed effect analysis of variance model is then 
a special case of the general linear model. k 
à In summary, Mendenhall has produced an excellent textboo 
integrating model construction and design in the context of ex 
perimentation. 

Wim H. Scam? 
Davm E. WILEY , 
University of Chicago 


David M. Messick (Ed.). Mathematical Thinking in Behavior! 
Sciences. San Francisco: W. H. Freeman, 1968. Pp. viii + 
$10.00 and $4.95 (paperback). 


, This well-selected collection of 27 articles that originally appeared 
in the Scientific American during the years 1948 through Deve 
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ber of 1966 is psychologist Messick's paean to the computer 
revolution, with interesting asides to its mathematical allies. The 
volume is exceptionally large, sturdily bound, and attractive. It 
includes many illustrations, some in full color. The articles are as 
they originally appeared, except for occasional minor editing. 
(One may be a bit startled to discover on page 47, in a 1949 
article, reference to "the late Norbert Wiener," especially when on 
page 82 in a September 1964 article he is referred to as “the eminent 
mathematiciam who died last winter.) 

There is a helpful introduction by Messick to each of the five 
paris: "The Analysis of Uncertainty: Probability" (5 articles), 
“Communication and Control” (6 articles), “Games and Decisions" 
(4 articles), “Imitations of Life" (6 articles), and “Recent Com- 
puter Applications” (6 articles). Also, the two-page index is useful, 
especially for the names of persons cited, though of course not 
really adequate for reference purposes. 

Among the authors are a number of the “greats,” such as Wiener 
on “Cybernetics” (1948), Morgenstern on "The Theory of Games” 
(1949), Weaver on “The Mathematics of Communication” (1949), 
and Suppes on “The Uses of Computers in Education” (1966). 
Also liberally represented are brilliant younger mathematically- 
oriented innovators. 

1 This is rich, varied fare, to which the reviewer can do little 
justice descriptively, much less critically. It is difficult to imagine 
readers of EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT who 
would not find the book well worth its price, even if they have 
been keeping up with the Scientific American faithfully during 
the years since World War II, or any graduate student in the 
behavioral sciences who would not be stimulated by its contents. 

Jurian C. STANLEY 

The Johns Hopkins University 


Phillip J. Rulon, David V. Tiedeman, Maurice M. Tatsuoka, and 
Charles R. Langmuir. Multivariate Statistics for Personnel Clas- 
sification. New York: 1967. Pp. xi + 406. $12.95. 


Ne sis book is directed toward the personnel psychologist, occupa- 
di: counselor, and clinical psychologist. It is probably the most 
m e presentation of multivariate analysis ever written. The 
on Strive to give the reader an intuitive grasp of multivariate 
es through the abundant use of graphs, numerical examples, 
are verbal discussion. The first equation appears on page 40. There 
bie least 113 figures and 52 tables. Very little mathematical prep- 
80 ies is assumed. Enough matrix algebra is introduced in the text 

at the reader will learn how to compute & quadratic form. 
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The book should be understandable to anyone who has mastered 
a first course in psychological statistics. 

The first half of the book is taken up with showing how to 
represent the relationship of an individual profile to profiles of an 
occupational group. The authors demonstrate that ordinary group- 
average profiles are inadequate for this purpose and introduce the 
concept of a “centour” score for an individual with respect to a 
group. (A centour score is defined as the percentage of a group 
falling outside the equidensity ellipsoid on which lies the point 
representing the individual’s profile.) Before discussing the general 
case, the book has several chapters presenting in painstaking detail 
the one-variate case, the two-variate case, and the three-variate 
case. The discussion of each case is divided into sections describing 
personnel classification for one-group, two-groups, three-groups, 
and an arbitrary number of groups. Many readers will want to skim 
over the first four chapters and start with the general case in 
Chapter 5. rue 

Chapter 6 describes how to compute the relative probabilities 
of membership of an individual in several groups, and it depicts 
the geometry of the regions in the test space which minimize 
classification error. Chapter 7 presents factor analysis in order to 
introduce the concept of reduced dimensionality. Chapter 8 on 
discriminant analysis then follows as the logical answer to the 
problem of reducing dimensionality while preserving discrimina- 
tion between groups. ' 

Chapter 9 is an excellent critique of classification with regression 
analysis but unfortunately contains a very inaccurate description 
of the findings reported by Dunn (1955) in an unpublished doctoral 
dissertation. The next chapter introduces the idea of probability 
of success and makes a rather unconvincing case for Tatsuoka's 
method of combining probability of membership and probability 
of success into a single index for classification. The last chapter 
is an illuminating summary discussion of personnel classification 
problems and might well be read first. Two appendixes give data 
for the examples used in the text. Aa 

The weaknesses of the book are primarily those of omission: 
There is no glossary, no index, no list of figures or tables, and 8 
very inadequate bibliography. The lack of an index or glossary 
makes it very difficult to dip into the book and read selected pa ; 
sages. The skimpy list of references impairs the book in its role ? 
a gateway to further knowledge. í 

The most striking omission in content is the total absente ai 
any reference to economic theory. Personnel classification 18 2 
cussed as if it had no relation to supply and demand, salaries 0 
profits, competition or social utility. Statistical decision theory 
is ignored and linear programming is never mentioned. 
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In this reviewer's opinion, no book on personnel classification 
could be considered adequate without a chapter describing the 
Transportation Problem, which is the operations research tech- 
nique of assigning men to jobs so as to meet quotas and to maximize 
the sum of any given measures of the suitabilities of the men in 
their jobs. Today the Army, Navy, and Marine Corps are using 
the Ford-Fulkerson algorithm for the Transportation Problem to 
assign hundreds of thousands of men to specialized training among 
hundreds of schools. The book omits reference or discussion of 
any modern work in this area and mentions only the early efforts 
of Rao, Brogden, and Votaw. 

The neglect of the Transportation Problem leads to a confound- 
ing of the problem of how to measure man-job suitability with 
the problem of how to assign men so as to maximize the sum of their 
suitabilities for their assigned jobs. Most of the discussion seems to 
assume that every man should be assigned to that job for which his 
suitability measure is highest. Such a procedure will, in general, 
attempt to assign more men than there are vacancies in certain jobs 
and to assign too few men to other jobs. The proper assignment 
method involves placing quotas on the number of men to be 
assigned to each job. The authors do not seem to realize the 
importance of this point. For example, in Chapter 10 a comparison 
is made among three different suitability measures without keeping 
quotas for the occupational areas constant. 

The fundamental thesis of this book is that group membership, 
not success in a group, is the primary datum in classification. After 
studying the text, this reviewer remains unconvinced of the valid- 
ity of this principle. This approach seems to imply that occupa- 
tional classification should be based on conformity rather than 
achievement. An Einstein would be judged less suitable as a phys- 
icist than a person whose test scores were close to the average Tun- 
of-the-mill physicist. The text argues that group membership is 
the appropriate measure of suitability because people who are 

over-qualified” for a profession should seek another, e.g, Ein- 
stein would be more over-qualified as a patent clerk than as a 
physicist. 

T On the contrary, this reviewer would say that the so-called 
overqualified” person rejects an occupation not because of too 
much success in it but rather because there are other, more-reward- 
ing occupations for which he is also qualified, and for which few 
other people can qualify. Probabilities of success, manpower short- 
ages and potential rewards remain the primary determinants of 
Personnel classification. Only when measures of occupational suc- 
cess are unavailable should group membership be resorted to as a 
measure of suitability. 


Despite certain omissions and the controversial nature of its 
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material, the book is to be recommended to personnel and clinical 
psychologists as a relatively simple presentation of many areas of 
multivariate statistics. Hopefully it may produce in those profes- 
sions a greater level of sophistication and a general upgrading of 
the theory and practice of personnel classification. 


Joun H. WoLrE 
U. S. Naval Personnel 
Research Activity 
San Diego, California 


Gilbert Sax. Empirical Foundations of Educational Research. 
Englewood Cliffs: Prentice Hall, 1968. Pp xiii + 443. $7.50. 


Instructors of introductory and survey courses in research 
methods have an impressive array of books from which to choose. 
"This book by Sax enlarges that subset which ought to be seriously 
considered. The author uses the conduct of a research study as his 
unifying theme and he follows this topic from the selection of à 
problem for investigation to the preparation of the final report. 

It is probably true that no book of manageable size which sur 

_Veys an area of content as broad as educational research can be 
comprehensive enough to satisfy every interested person. Thus, 
each text will be judged in terms of the agreement of topics selec- 
ted for inclusion and those topics thought to be important. A 
judgment about adequacy in this regard is, therefore, largely 8 
personal one and must be made by each potential user. G 

The orientation of the author is clearly one of empiricism and 
experimentalism. Problems of descriptive and correlational re- 
search are discussed only briefly and this discussion serves the 
purpose of giving students a nodding acquaintance with their pos- 
sible areas of application and their limitations. Historical research 
is omitted, not because the author feels that it is unimportant but 
because “. . . historiography is in itself a highly complex fiel 
deserving to be studied on its own by those planning to specialize 
in the history or philosophy of education.” 

The first two chapters provide a setting and a rationale for v 
utilization of the scientific method in education. They contain à 
brief but excellent summary of the history of psychometrics and 0 
research in education. This summary serves to illustrate that 00n- 
temporary research in education, crude as it may be in some n 
spects, is remarkably sophistieated when compared with be 
attempts. The historical summary is valuable in helping to plao 
current activities in proper perspective. ted 

The next three chapters lead the student toward a complet 
research study by offering guidance in the techniques of pei 
a research problem, reviewing the literature, and formulating 


 ———————— 
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testable hypothesis. These chapters provide a nearly optimal com- 
bination of theoretical considerations, practical advice and refer- 
ences for further information. 

The chapter devoted to sampling procedures stresses the im- 
portance of using appropriate methods to achieve representative 
samples. This important topic receives considerably more atten- 
tion in this book than it typically does in introductory texts. Per- 
haps the only weakness in an otherwise informative presentation 
is an unfortunate confounding of the issues of unbiased samples, 
unbiased statistics and consistent statistics. However, this is not a 
serious weakness since it is not crucial in developing the main 
theme of the chapter. 

Problems and procedures of observation and of the reliability 
and validity of measures are discussed at some length. Again, 
theoretical as well as practical issues are covered in some detail. 
A wide variety of observational techniques available to the re- 
searcher are described along with their strengths, weaknesses, and 
appropriate areas of application. This text is not a cookbook 
although it contains ample guidance in the practical aspects of the 
measurement of behavioral traits; it is not a theoretical treatise 
although it incorporates theoretical considerations into the discus- 
sion. Authors of introductory texts are caught in the dilemma of 
reluctance to assume more than a minimal level of statistical 
sophistication on the one hand and the inability to develop the 
topics of measurement and research design without it on the other 
hand. Sax’s solution is to present the appropriate formulas without 
reference to first principles. Some will find this approach adequate, 
some bewildering and some inappropriate. No one will find it en- 
tirely satisfactory. 

In the discussion of experimental design, the author draws 
heavily and profitably from the now-classic Campbell-Stanley 
chapter on experimental and quasi-experimental designs for re- 
search. Any attempt to present the fundamentals of design and 
the appropriate statistical analysis procedures in less than 50 pages 
must represent a serious compromise and can, at best, be only 
Partially successful. The space restrictions imposed by the facts 
of publishing life make the efficient use of available space that 
Jie more important. It seems to this reviewer that the causes 
dela teney and pedagogy would have been better served by 
(e eting some topics not readily adaptable to educational research, 
eg. Latin squares and Greco-Latin squares) and others for which 
Ey no satisfactory statistical treatment (e.g. time series). On 
Chatto er hand, it may have been advantageous to expand the dis- 
Bilt, of experimental control and to elaborate on the more 
Of Monally useful designs and techniques such as the analysis 

Covariance and factorial and nested designs. An expansion may 
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have eliminated the lack of clarity in the presentation of repeated 
measures designs and the analysis of covariance which may result 
in some readers concluding that these are alternative procedures 
to accomplish the same outcome. These critical references are not 
meant to detract unduly from the generally high quality of this 
brief presentation of a highly complex topic. 

One expects that a text on research methods will make extensive 
use of the results of previous research. This expectation is fulfilled 
in Sax's book. Readers will find an abundant use of research 
findings to illustrate, support and extend the major points of 
emphasis throughout the book. In addition, there are numerous 
suggestions for further reading on almost every topic discussed | 
so that those who are so inclined may pursue their interest with a 
minimum amount of time spent in nonproductive searching through 
the literature. Instructors will find the many practice exercises 
at the end of each chapter helpful in devising meaningful experi- 
ences for students. 

In summary, Sax has produced a book which is of general high 
quality and which provides instructors with another attractive 
option in choosing an introductory text in research methods. 


Percy D. PECKHAM 
University of Washington 


George W. Snedecor and William G. Cochran. Statistical Methods. 
(6th ed.) Ames, Iowa: The Iowa State University Press, 1967. 
Pp. xiv + 593. $8.50. 


"The most recent edition of Snedecor and Cochran is an important 
book not only because of the huge impact earlier editions have had 
on research workers and research, but also because of the extent 
to which the earlier editions have influenced the available texts m 
statistical methods. It needs hardly more than a cursory lance 
at copyright dates, tables of contents, and organization within 
chapters to determine that Snedecor is the lineal ancestor of almost 
all popular texts in statistical methods. Because of this fact one may 
predict with some confidence that whatever is new or different in 
this edition of Snedecor will soon appear in other texts as well. 

The one area for which this assertion will probably not hold 
true is in the different but not new introductory portion of the 
book. A question that any methods book author must answer 15 how 
to tie his material on techniques together. The obvious aai 
(and the one that is being used to a progressively greater exten 
and degree in psychology and education) is to include the appto- 
priate theoretical linking material. Unfortunately, this is largely 
impractical from a pedagogical point of view for any very com 


prehensive treatment of methods because of the high level mathe- 
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matical nature of the theory and the lack of mathematical sophisti- 
cation of the students. For less complete texts it is highly appealing 
because it allows easy access to methods via elementary (and easily 
demonstrable) probability theorems. Snedecor and Cochran cite 
one of their purposes as providing a text, but they also aim to 
provide a reference source on techniques for research workers. 
Hence they have chosen sampling experiments as an alternative 
mode of entering and binding together statistical techniques into 
a coherent whole. The sampling experiment approach has several 
advantages. It is highly meaningful intuitively and practically to 
both research workers and students. It can be discussed in detail 
with almost no initial fuss and confusion over non-essential 
definitions and terms. But, most basically, it is directly applicable 
to further reasoning about statistical tests and estimates with no 
intermediate mathematics so that this kind of attention can be 
focused on the arithmetic and algebraic requirements of the statis- 
tical techniques per se. 

Snedecor and Cochran’s first chapter, then, develops some very 
basic ideas of sampling and sampling distributions and descriptions 
of sampling distributions by means of sampling experiments. This 
leads straightforwardly to the idea of theoretical sampling distribu- 
tions and to the comparison of theoretical and empirical outcomes. 
Ideas of null hypotheses and tests of significance then develop quite 
naturally. Formal concepts of probability and probability calcula- 
lions are not introduced until Chapter 8 in which the binomial dis- 
tribution is developed both for its own inherent usefulness and as à 
prelude to chi-square in Chapter 9. 

The second and third chapters describe the normal distribution 
(also approached by empirical sampling experiment) and give 
extended consideration to measures of central tendency and dis- 
Persion, Some comparisons of estimators in terms of efficiency 
^re made. The t-distribution also appears in the third chapter and 
its characteristics are discussed in detail. À 

t is not until the fourth chapter, however, that tests and esti- 
pe of differences begin to be discussed systematically. It is also 
cre that the first muddiness appears. The problem of pooling or not 
Poling variance estimates is very difficult to wade through. Stu- 
tents in the reviewer's methods course found this topic almost 
impossible to comprehend from the text. On the other hand the 
ar oussion of pairing observations, and the advantages and dis- 

vantages and difficulties of this procedure, are excellent. 
(apo S Pter 5 something relatively unusual in most methods texts 
tra ough not new to Snedecor) is done. Nonparametric proced- 
bak for testing differences are included and developed in the 

a netream of the book. Moreover, throughout the rest of the 

ok nonparametric techniques are discussed as the occasion arises 
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and within the framework of statistical methods generally. They — 
are not set off by themselves with the implication that they are 
poor relations and techniques of the last resort. 

The chapter on regression is excellent. The reasoning and com- 
putations flow so clearly from the model that the relatively com- 
plex estimation procedures that follow are made to seem unavoidable 
consequences of the model. Linear calibration and X-variables 
subject to error are also discussed. 

The discussion of product-moment correlation in Chapter 7 is 
clear, as are nonparametric correlation techniques, and the bino- 
minal and chi-square procedures in Chapters 8 and 9. The section 
on chi-square is quite good, but much more material on partitioning 
multiple degree of freedom chi-squares would have added greatly 
to its usefulness since such partitioning is one of the most rapidly 
increasing statistical procedures. The small increment in space 
would have been well justified. 

Snedecor and Cochran’s treatment of one- and two-way classi- | 
fications in anova is, as expected, excellent. The idea of components | 
of variance and their application is introduced early and effectively. 
Hierarchical nested classifications are dealt with in a similarly mean- 
ingful manner. There is a good, clear, and detailed exposition of the 
effects of violations of assumptions in anova and of transformations 
and other remedial measures. Perhaps the only area of any weakness | 
in these two chapters is the section on multiple comparisons. Too 
much technique, explanation, results of empirical studies, and 
alternatives are squeezed into too small a space. The result is, pres | 
dictably, lack of clarity and a moderate degree of confusion. Linear 
comparisons and the general concept of partitioning treatments 
sums of squares are handled in an exceptionally clear and coherent 
manner. Throughout this section structural models are used to 
good effect to show how parameters are estimated, how error 18 
d for and handled, and how tests of significance are arrive 
at. 

The chapter on factorial experiments and split plot designs 18 
not as well done as the previous two. The material on factorials 
(and particularly on partitioning interaction sums of squares) 18 
clear, as is the treatment of orthogonal polynomials and response 
surfaces, but the segment on split-plot designs is cluttered, com 
pressed, and confusing unless one happens to be an agricultura 
scientist and hence familiar with the specific example used. In — 
general, the diverse nature of the examples used by the authors does 
not detract from the usefulness of the book. If anything, it enhances 
it since it forces students to begin generalizing their new 
knowledge to their own fields immediately. But in the splt-P? 
section (which is the prototype for the repeated measures designs | 
so popular in psychology and education) this is unfortunately 2° 

the case. 
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In the last five chapters some of the most useful and forward- 
looking material, and also some of the most dated techniques, are 
presented. The treatment of multiple regression in Chapter 13 is 
superior. One might fault it slightly because it uses some matrix 
concepts but does not pay enough attention to matrix representa- 
tion and computation. However the often totally-neglected pro- 
cedures for selecting a best subset of predictor variables are dis- 
cussed, and multiple discriminant functions are introduced in an 
opportune way. The treatment of covariance analysis in Chapter 
14 has excellent breadth and includes consideration of multiple 
covariates. Chapter 15 provides a fine introduction to the rapidly- 
expanding field of nonlinear models. The most useful techniques 
are discussed in understandable terms and the reader is given a good 
working knowledge of procedures for handling non-linear relation- 
ships. The material in Chapter 16 dealing with analysis of two- 
way classifications with unequal numbers is much less useful than 
it once was. Today most data analysts do not bother with approxi- 
mate solutions, but instead use regression routines to get direct 
estimates of effects. The final chapter on sampling is outstanding, 
It Js comprehensible, pertinent, well-organized and should be re- 
quired reading for all social and behavioral scientists. 

On the whole, Snedecor and Cochran must still be judged a 
superior reference for research workers and a good text for a 
course or course sequence in statistical methods. Its coverage is 
broad and its treatment is practical and useful. It is a leader in the 
trend toward routine application of multivariate, nonparametric, 
and nonlinear techniques. It is not, does not pretend to be, and 
does not wish to be a theoretical book. Nevertheless it does, by 
Means of empirical sampling experiments, develop a sound basis 
for the statistical methods it teaches. Beyond that, mathematical 
Proofs and further examples of the techniques introduced are al- 
most always referenced. The references are comprehensive and 
relevant as are the examples and example-problems in the text. 

he tables have been gathered into a single appendix in this 
edition and are, with the exception of the table of the curnulative 
normal, most useful. One must conclude that the sixth edition of 
Snedecor and Cochran’s text Statistical Methods will continue to 


€ 8 pre-eminent book in the field. 
James A. WALSH 


Towa State University 


Shelly C. Stone w ing." Guidance 

" and Bruce Shertzer (Eds.) “Testing.” Guaanc 
Monograph Series III. New York: Houghton Mifin, 1968. 
$1495 (nine paperback volumes). 

Mu Teviewing the present series, each monograph will be sum- 

erred briefly, followed by a general discussion and evaluation 
the nine volumes as a whole. 
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Kathryn W. Linden and James D. Linden. Modern Mental Meas- 
urement: A Historical Perspective, $1.95. 


This beginning work presents a chronologically ordered sum- 
mary of the history of the mental measurement movement. Chapter 
one points out Spanish, German, and British historical contributions. 
Chapter two outlines the work of Fisher, Rice, Thorndike, Spear- 
man, Terman, and Otis. Also included is an interesting treatment 
of the people involved and procedures used in the development 
of the first group ability tests used in the U. S. Army. Chapter 
three discusses, among other things, the feud between the uni- 
factor and multifactor theorists. 


Womer, Frank B. Basic Concepts in Testing, $1.75. 


This volume includes elementary statistical concepts through 
correlation, separate chapters on reliability and validity, and a 
discussion of basic test construction. Sections on statistics and test 
construction are routine. However, major strengths are the re- 
liability and validity chapters. Womer has done an excellent job 
introducing the concept of error in reliability and pointing out how 
different procedures for estimating reliability are sensitive to dif- 
ferent kinds of errors. 


Norville M. Downie. Types of Test Scores, $1.35. 


Included in this monograph are discussions of various descriptive 
statistics (mean, standard deviation, linear regression) and sampling 
statistics (standard error of the mean, estimate). One chapter 
focuses on derived scores (age norms, etc.), one on centile points 
and ranks, and another on standard scores (Z, etc.). The last set- 
tion deals with norms. 


Robert H. Bauernfiend. School Testing Programs, $1.35. 


The present volume begins by emphasizing the importance of 
well defined, behaviorally based educational objectives. It provides 
resource information on tests and measurements, and suggests the 
kinds of test information teachers consider the most desirable 
to know. The concept of a battery of tests measuring genera 
educational development is developed along with names ot sug 
gested standardized tests. One chapter discusses ten fallacies 1n 
school testing programs, while the last section describes how à 
testing program should be organized. 


Howard B. Lyman. Intelligence, Aptitude, and Achievement Test- 
ing, $1.35. 


Topies such as correction for guessing, different kinds of max" - 


mum performance tests, and the various types of objective and 
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non-objective items are discussed in this volume. One chapter also 
presents a listing of different test publishers and scoring services. 
The last two chapters discuss terminology of testing, and the mean- 
ing and use of test scores. 


Howard W. Stoker. Automated Data Processing in Testing, $1.55. 


A general introduction to the field of electronic data processing 
is given showing its present and anticipated future importance to 
decision making processes in guidance. Chapter one presents several 
types of applications of coded information. Chapter two discusses 
unit record equipment and computers, and the basic steps involved 
in electronie data processing. Chapter three examines various as- 
pects of processing test results including a discussion of different 
scoring machines. The last chapter describes the role of electronic 
data processing in educational research and briefly elaborates on 
the concept of data banks. The appendix includes a useful glossary 
of common data processing terms. 


. William C. Cottle. Interest and Personality Inventories, $1.95. 


In the first chapter, Cottle presents general discussion on con- 
struction, administration, scoring, and interpretation of interest 
and personality inventories. Chapter two indicates the need for 
counsellors to develop professional frames of reference in using 
"nd interpreting psychological instruments. The remaining sec- 
tions of the monograph discuss commonly used interest inyen- 
tories (SVIB, KPPS, ete.) and selected personality questionnaires 
(MMPI, CPI, ete.). 


James D. Linden and Kathryn W. Linden. Tests on Trial, $1.95. 


The major purpose of this monograph is to review the most 
commonly used standardized mental ability, achievement, and in- 
terest, tests. A total of 24 instruments are reviewd—ten ability, 
ten achievement, and four interest inventories. Hach section is 
Preceded by a short discussion of the history and general char- 
acteristics of the particular type of test. Each test review is pre- 
sented in a standard format giving information such as levels, 
Purpose, description, reliability and validity, and strengths and 


limitations, 


James R. Barclay. Controversial Issues in Testing, $1.75. 


This Monograph focuses on several widespread criticisms of 
Psychological instruments. Chapter one lists and discusses these 
"llicisms, Chapter two defines culture and examines the interac- 


won between culture and testing practices. Chapter three discusses 
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Studies related to testing and the environmental press. The last 
chapter investigates the criterion problem as it relates to using 
and interpreting tests. 


Evaluation of Series 


In reviewing the series as a whole, four general questions will 
be discussed. How does the content compare to content found in 
standard texts already in the field? Is there a problem of volume to 
volume overlap? How does the cost compare to standard texts for 
the material covered? What is the quality of the volumes? 


Content 


In general, the content found in each volume differs little from 
that found in texts such as Thorndike and Hagen (1961) and 
Stanley (1964). Several exceptions can be noted however. The 
monograph in Modern Mental Measurement: A Historical Per- 
spective presents a much more detailed history of mental meas- 
urement than found in other introductory books. In my estimation, 
this volume is the best contribution of the series although will not 
be the most useful. Tests on Trial, the monograph which reviews 
24 commonly used tests, will probably be the best seller. It is not 
that the information contained is unique but rather the accessable 
form in which it is presented. Controversial Issues in Testing 
presents a good discussion of the interaction of culture and prat- 
tices of testing—a topic rarely discussed in other introductory 
sources. 

Although the series is slanted toward counsellors and their use 
of tests, it is interesting to note that, unlike most introductory 
texts, there is no discussion of 1) individual intelligence tests 
except as they relate to history and 2) ways in which tests 1n gen- 
eral and test items in specific can be improved. 


Overlap 


A major problem of this work is the degree to which numerous 
topies overlap in two or more volumes. Half the material 9 
Chapter two in Types of Test Scores (descriptive statistics) has 
already been presented (by different authors though) in Baste 
Concepts in Testing. A list of test publishers in introductory texts 
is located in three different volumes. Different kinds of test scoring 
services and machines are discussed in both Intelligence, Aptitu A 
and Achievement Testing and Automated Data Processing 1 et i 
ing. Instruments such as the SVIB, CPI, and KPPS are doubly pa 
viewed—admittedly in somewhat different ways—in Tests ga 
Trial and Interest and Personality Inventories. These four 2 
amples are the major points of overlap—there are others of lems 3 
importance. However, i& seems as though the series could ha 


- d 
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been edited more carefully, re-organized and condensed to avoid 
overlap. For example, Basic Concepts in Testing and T'ypes of 
Test Scores should have been condensed and put into one—not 
two separate volumes. To have descriptive statistics, reliability and 
validity in one volume and descriptive statistics again, derived and 
standard scores, and norms in the other volume seems to split up 
natural “go-together” material. 


Cost 


Most introductory texts in measurement average between 400 
to 600 pages at a cost of $8.00 to $10.00. Taking into consideration 
the cost, length, and amount of overlap (I judge there to be at 
least fifty pages)—the present series becomes more costly for less 
material. It might be argued that the number of volumes allows 
greater flexibility. Guidance personnel can choose only those texts 
that are necessary thus cutting down total cost. However, because 
each volume covers so little material, the probability of only se- 
lecting a few becomes remote. For example, in order to get basic 
statistical concepts one needs two volumes; for a discussion of 
cognitive and non-cognitive tests one needs three volumes. It is 
doubtful that anyone seriously contemplating using the series 
would select fewer than 4 or 5 volumes—which brings the cost 
close to that of one much more complete introductory text. My 
general feeling is that the best economy the editors could have in- 
troduced would have been to carefully edit out the overlap and 
condense the material into one or at most two volumes. 


Quality 

In general, I think the overall series is well written. However, 
one gets the impression that in most volumes only a superficial 
treatment of the topic is given. In Intelligence, Aptitude, and 
Achievement Testing for example, there is no discussion of the 
Steps involved and problems associated with standardizing a test, 
no specific examples (if there are some!) of differences between 
aptitude and achievement items, or examples of items found on 
commonly used aptitude and achievement tests. In other cases, 
Bre seem to discuss points at length within their topics with- 
the loin the relevance of the discussion. This stands out in 
Dos part of Chapter three of Interest and Personality Inven- 


In addition to the above, the only major quarrel I have with the 
authors’ opinions are some of ihe aaa made by Cottle in 
umen and Personality Inventories. In one sentence he states 
4 at "It should be noted at first that the client knows nothing 
E how to interpret, test scores or profiles." (p. 15). This seems 

er harsh-— perhaps “in general, the client rarely understands 
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how to interpret .. ." would be better and closer to the truth, 
Also, one gets the impression from this volume that the only | 
problem with interest and personality tests lies with the user—no 
comments are made suggesting that perhaps some of the instru- 
ments have inherent weaknesses. Therefore it seems as though 
Cottle assumes the validity of the instruments just as long as 
people use them correctly. | 

In overall summary, it seems that the series adds little to what 
is found in a variety of introductory texts. The overlap problem 
is quite apparent and the cost is high. It seems as though a much | 
better editing job could have been accomplished. However, even 
with careful editing one can seriously question the overall use- 
fulness of such a series since it deals with a topic normally taught 
or learned as a unit. I can see little or no advantage to proliferating 
the unit into nine volumes when one, more tightly knit, already 
available volume will suffice. In spite of this, the volumes Mod- 
ern Mental Measurement: A Historical Perspective and Tests on 
Trial are excellently done and should be received well. 


Dennis ROBERTS à 
Department of Measurement & Evaluation 
The Ontario Institute for Studies in 
Education 

Toronto 5, Canada 


Edward A. Suchman. Evaluative Research: Principles and P rache 
in Public Service and Social Action Programs. New York: Rus 
sell Sage Foundation, 1967. Pp. ix + 186. $6.50. 


Suchman has produced a well written and timely book that de- 
serves a place on the shelves of administrators who are planning 
and responsible for projects designed to change human bw 
This book will also be widely used in classes required for en | 
who will be involved in the evaluation of funded projects. ie 
inereasing requirement of planned evaluation as an essential iu E 
dition for funding projects has already made this function an ! 
portant aspect of the work of a trained researcher. k uld 

Evaluative Research is organized into ten chapters which ve 
be grouped into sections on (1) the need for evaluative ec 
including the difference between evaluation and oa ud 
search, (2) the nature of evaluative research, and (3) admi 


"n "n i 
istrative problems related to evaluative research. Suchman rte 


petition 
in con- 


the value loading inherent in any evaluation and points up 
creasing competition for support of projects. The com 
will increase the need for sound evaluations in order to ga 
tinuing support for projects. ublie 
While Suchman draws most of his illustrations from P 
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health studies there is no difficulty in translating the principles he 
proposes to evaluation of funded educational projects. He puts 
stress on competition between traditional ways of doing things 
and new approaches, bringing support decisions which require 
the critical evaluation of on going and well established traditional 
programs. The need for critical evaluation of traditional pro- 
grams is particularly germaine to the field of education where 
many of the projects have been funded in order to point to new 
ways of gaining old objectives as well as new ways of gaining new 
objectives. 

Suchman stresses the need for theory to undergird both the 
project undertaken and the research evaluation. He points out 
"many program objectives are based upon largely untested or 
even unsound assumptions whose validity rests primarily upon 
tradition or common sense and not on proven effectiveness." He 
stresses throughout the book the reality of the difficulties posed 
by the fact that evaluative research deals with projects to which 
strong emotional loadings have been attached while getting the 
project organized and funded and that the evaluation has strong 
threat value to project administrators. 

Project and research administrators will both find Suchman’s 
descriptions of time order helpful. Project goals may be long 
term, such for instance as raising the working life productivity of 
people from poverty, they may be intermediate term as when the 
goal is stated as increasing the probability of young people from 
poverty completing high school, or they may be short term such 
48 à goal that nursery school attendance will increase the number 
of words in the active vocabulary of pre-school children from 
Poverty. Suchman makes the point that in nearly all cases the im- 
Mediate short-term goals implicitly assume that reaching these 
goals provide a step toward the more distant goals even though 
these objectives may not be explicitly stated. Suchman empha- 
Sizes the desirability of making explieit the long-term goals and 
of being equally explicit about the ways in which the intermediate 
and the short range goals will contribute to reaching the more dis- 
tant objectives. Suchman also points out that each of the stages 
should be evaluated for effectiveness since there is interaction be- 
een the goals. In some cases the intermediate goals are not a 
ien Progression from reaching the short-term goals. On e 
pam the logie may be well founded but the EO x 
resp AMIng may be poor so that the intermediate goals are no 
ached because the short-term objectives have not been reached. 
Carefully executed evaluation studies will reveal where the break- 
“own occurs and, if properly structured, why the breakdown 
tame about. Suchman has some well documented illustrations of 

* way in which various levels of objectives can be structured 
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and the way in which each of them may be evaluated effec- 
tively. 

The heart of Suchman's book is built around the statement, 
“. . . we would like to propose five categories of criteria according | 
to which the success or failure of a program may be evaluated, 
These are: (1) Effort, (2) Performance, (3) Adequacy of Per- 
formance, (4) Efficiency, and (5) Process.” Suchman points out 
that the first two necessarily precede the last three categories. 

In evaluating effort in an educational setting for pre-school 
children some of the factors that might be tabulated would prob- 
ably include: (1) the adult-child ratio, (2) the teacher-teacher 
aid ratio, (3) the number of home visits made, (4) the amount 
and variety of materials and equipment available, (5) the salary 
levels paid in competition to other open job opportunities, (6) 
the length of time the children were in school, (7) the regularity 
of attendance, and (8) the length of the coffee breaks taken by 
the teachers. There are many other related indices that could be 
tabulated. The prineiple involved is that, in general, the more 
active the personnel are the more likely it will be that something 
desirable will be happening. Obviously this kind of tabulating 18 
comparatively easy to handle but leaves something to be desired 
as far as certainty that the activity is leading to desirable results. 

The crucial evaluation is that of performance in which the ef- 
fectiveness of the effort is measured. This performance can be 
measured in terms of the immediate, the intermediate, or the long- 
term goals. In the case of the pre-school experience the, long- 
term goal would be more effective adult performance. Obviously, 
while this is the eventual criteria, decisions must be made before 
these goals can be assessed effectively so more immediate per- 
formance goals must be set up and evaluated. í 

The adequacy of the performance deals with the percentage 0 
the need that has been met. In terms of pre-school education tl sf | 
evaluation would take into consideration the percentage of th 
children from poverty who are being served by a program in 
comparison to the total need. A value judgment comes into "m 
here, as in all evaluative research, which concerns the desirability 
of intensive work with a small number compared to less intensivi 
and presumably—to be evaluated—less effective work Wi 
a larger group. 

Efficiency is closely related to adequacy of performa 
focuses on the question of whether similar results cou 
been gained from similar effort applied differently. In an S dn 
omy where costs were not limited effectiveness would ig " 
overriding criterion. In trips to the moon improving the € ia 
tiveness of a component from 95 per cent reliability to 100 An à 
cent would justify increasing the cost by 20 times. In educa" 


nce and 
Id have 
on- 
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similar increases in cost for similar increases in effectiveness have 
not yet been accepted. 

The fifth criterion Suchman stresses is that of process. This 
evaluation deals with the question of why the program was either 
successful or unsuccessful. While evaluative research may, and 
often does, stop short of analysis of the reasons why the program 
succeeds or fails; this analysis is of critical importance in modifying 
the program intelligently. 

Suchman continually stresses the reality situation which takes 
into account the administrative difficulties of setting up control 
groups that deny service to needy people in order adequately to 
judge the effectiveness of the services rendered. He stresses the 
multiple kinds of causation and the multiple effects that emerge 
in a real social context. His design suggestions are helpful. His 
chapters on administration in which he separates evaluative re- 
search as a tool of administration and the administration of evalu- 
ative research operations is clear and clean. 

Evaluative Research is a useful and helpful book. Probably the 
best evaluation I can make is to say that I expect to use ideas 
Presented in some evaluative research projects I am engaged in 
and that I expect to use the book with classes studying research 
design and operation. 

Joun A. R. Witson 
University of California 
Santa Barbara 
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CORRELATIONS BETWEEN SCORES ON 
PERSONALITY SCALES WHEN ITEMS ARE 
STATED IN THE FIRST AND 
THIRD PERSON FORM’ 


ALLEN L. EDWARDS 
University of Washington 


Tun Edwards (1967) Personality Inventory (EPI) consists of 
five test booklets, each containing 300 statements. Two of the test 
booklets, IA and IB, are comparable forms for 14 of the 53 scales 
that can be scored in the complete EPI. In taking the EPI, the 
examinee is asked to judge whether those individuals who know 
him best would answer the items True or False if they were asked 
to describe him. Defining the task for the examinee in this way also 
makes it possible to inform him that there is a “correct” or “right” 
answer to each item, the “correct” answer being the one that most 
of the people who know him best would give if they were, in fact, 
to describe him. In accordance with the task set for the examinee, 
all of the items in the EPI are stated in the third person form. 

If the items in the EPI were to be changed to the first person 
form, would scores on the EPI scales be changed drastically? No 
evidence on this point has been published. The present study reports 
on the correlations between the 14 scales in Booklet IB of the EPI 

2: these same scales when the items are stated in first person 
orm, 


Method 

and 115 males were 
f the EPI, in that 
leted the Allport, 


Vernon, and Lindzey Study of Values. At a second testing session, 


E 


In a large scale testing project, 171 females 
administered Booklets IA, II, III, IV, and IB 0 
Order, at one testing session. They then comp 
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two days later, the same subjects were administered the California 
Psychological Inventory and then Booklet IC. Booklet IC contained 
the same items in the same order as the items in Booklet IB of the 
EPI, except that all items were stated in the first person form 
instead of the third person form. After completing Booklet IC, 
the subjects were administered a number of other personality 
inventories. 


Results and Discussion 


Table 1 gives the correlations between the scales in Booklets 
IA and IB with those in Booklet IC. The correlations between the 
scales in the two comparable forms, IA and IB, are also given in 
the table. It may be noted that the values of rac are quite com- 
parable to the values of ras. Correlations of the scales with the 
Social Desirability (SD) scale of the Minnesota Multiphasic 
Personality Inventory, the Kuder-Richardson Formula 20 values of 
the scales, and the means and standard deviations of the scales 
also tend to be much the same for corresponding scales in all three 
test booklets. 

Because Booklet IA of the EPI is a comparable form of Booklet 
IB, it seems reasonable to believe that similar results would have 
been obtained if the items in Booklet IA had been put in first person 
form rather than those in Booklet IB. It is possible, however, that 
translating the items in Booklets III, IV, and V of the EPI into 
first person form might result in scores that differ considerably from 
those obtained with the standard EPI booklets. 

A limitation of this study is that the same subjects responded 
to the items in each of the three test booklets. Thus, there is no 
way of knowing to what degree the scores obtained with Booklet 
IC were influenced by memory and other carry-over effects re- 
sulting from the prior administration of Booklets 1A and IB. For 
example, if subjects had been randomly assigned to Booklet IC, 
8o that they answered only the items in this booklet and not those 
in IA and IB, the means, standard deviations, and the correlations 
of the scales with the SD scale might differ considerably from the 
corresponding values for Booklets IA and IB. 
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OF CHOICES PER ITEM 


ROBERT L. EBEL 
Michigan State University 


By making several reasonable assumptions, it is possible to esti- 
mate the reliability that should be expected in an objective test as 
a function of the number of choices per item. These estimates for a 
100 item objective test are presented in Table 1. They indicate an 
appreciable increase in reliability when the number of choices per 
item is increased from two to three; a smaller increase in going to 
four choice items; and still smaller increases beyond that point. They 
indicate, too, that a 100 item true-false test yielding a reliability 
.. coefficient of .74 is a reasonably good test, as true-false tests go, 
Whereas a reasonably good 100 item test made up of four-choice 
| items ought to yield a reliability coefficient of .86. 

. The same assumptions can be used to answer & different question: 
how many two-choice, three-choice, ete. items must be included in 
| a test to justify expectation of a reliability of .90? These values 
are presented in Table 2. Again, the successive differences in 
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EXPECTED RELIABILITY AS A FUNCTION 
I 


numbers of items, substantial at first, diminish progressively. But 


TABLE 1 


Expected Reliability of a 100 Item Objective Test as a 
Function of Choices Per Item 
m —— M ——————— 
Choices per Expected 
— l0 qe ec 
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TABLE 2 


Number of Items Needed to Justify Expectation of a Reliability of .90 for 
Various Numbers of Choices Per Item 


Choices per Number of 
Item Items 


aor wr 
c 
S 


in all cases, the tests need to be long to justify the expectation of 
fairly high reliability. 
The values shown in Table 1 were obtained from the formula: 


pet [i- 52] o 


The values shown in Table 2 were obtained from the formula: 


r- RH] o 


which is simply an algebraic transformation of part of Formula 1, 
omitting the fraction k/k—1. In both formulas, k represents the 
number of items in the test, N the number of choices per item, and 
T, of course, the reliability coefficient. 

Formula 1 was derived on these five assumptions: 


1. A reasonable estimate of the mean score of a good objective 
testis a value midway on the score scale between the maximum 
possible score and the expected chance score. 

2. A reasonable estimate of the standard deviation of the scores 
on a good objective test is one-sixth of the difference be- 
tween the maximum possible score and the expected chance 
Score. 

3. A reasonable estimate of the reliability coefficient of an ob- 
jective test is provided by Kuder-Richardson Formula 21. 

4. The maximum possible score on an objective test is k, the 
number of items. 

5. The expected chance score is the number of items, 
by the number of choices per item, N. 


Expressed as formulas, the first three assumptions become: 


k, divided 
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u =r Sth à 
(NI) 
c-k (6N) (4) 
j M 
Mi= T 
k ( ) (5) 
r= a) = 3 


Substitution of the expressions for M and c given by Formulas 
3and 4 in Formula 5 leads, by simple algebra, to Formula 1. 

The values shown in Tables 1 and 2 were obtained from 
Formula 1 by holding constant (at 100) the value of k, the 
number of items, and by allowing n the number of choices per 
item to vary. One could, of course, hold n constant and allow k 
io vary. This would presumably indicate how reliability varies 
with test length. Values obtained for a two-choice (true-false) 
test are presented in Table 3. 

These values show the general pattern of negatively accelerated 
(diminishing) increases that is commonly associated with this 
relationship. How closely do they agree with those given by the 
Spearman-Brown formula? Taking the value of a 100 item test as 
the starting point, application of Spearman-Brown gives values 
shown in the third column of the table. Agreement is fairly good, 
though not perfect for the tests longer than 100 items. For the 50 
item test, the discrepancy is sizeable. 

One major reason for the discrepancy is that our assumption con- 
cerning a reasonable estimate of the standard deviation, while ac- 
ceptable for a test of 100 items, is not equally acceptable for 
much shorter tests. For the ratio of standard deviation to available 
Score range (i.e. maximum score minus expected chance score) 
tends to be greater for short tests than for long tests. 


TABLE 3 
Expected Reliability as a Function of Number of Items in a True-False Test 
Number of Expected Spearman-Brown 
Items Reliability Predictions 

50 48 .59 
100 .74 = 
150 .83 .81 
200 .86 .85 


s: 70000 EP SNES ee es 


568 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The variance of scores on a test of k items is the sum of k item 
variance plus the sum of k (k — 1) inter-item covariances, Since 
the inter-item correlations in typical tests tend to be small (on 
the order of .05) the item variances tend to be much larger than 
the inter-item covariances. On a ten item test, these larger variances 
make up 10 per cent of the elements contributing to the score 
variance. On a one hundred item test, however, they make up 
only 1 per cent of the elements. Thus, the standard deviation of 
a short test tends to be a larger fraction of the number of items 
than is true of a long test. An illustration of this relationship, 
in specific terms, is presented in Table 4. Note that the standard 
deviation is larger relative to the useful score range (k/2 for two- 
choice items) in the shorter tests than it is in the longer tests, 

Thus, if one is working with very short tests, more reasonable 
estimates of reliability could be obtained by assuming, not that the 
standard deviation is one sixth of the difference between the 
maximum possible score and the expected chance score, but by 
assuming it to be one third for tests of ten items or less, one 
fourth of tests from 11 to 20 items, and one fifth for tests from 
21 to 60 items. These assumptions would, of course, lead to 
corresponding changes in formulas 1 and 2. However, most of the 
tests we work with are long enough to make these adjustments 
unnecessary, 

Returning to the main argument, let us point out that, the 
expected reliability is not a maximum reliability. A test of given 


TABLE 4 
Hypothetical Ratio of Useful Score Range to Standard Deviation for 
True-False Tests of Various Length 


Assumptions: of = .18 rio? = .0054 


Score variance: of = ko + klk — yas? 


Standard 
Test Length Score Variance Deviation 
mee eg eee 
$ 
k ott c 2 
10 2.286 1.51 ae 
25 7.740 2.78 Hb 
50 22.93 4.72 HL 
100 71.46 8.45 P 
200 250.92 15.8 : 
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length, with a given number of choices per item, could show much 
higher reliability than would be predicted from Formula 1 if: 


1. The test items are unusually high in quality. 
2. The test is unusually homogeneous in content. 
3. The group tested is unusually variable in ability. 


Alternatively, and for complementary reasons, à test might show 
much lower reliability than predicted by Formula 1. The main 
thing that the formula does is to show, on the basis of uniform 
assumptions for all tests, what reliability may be normally expected 
from tests composed of various types of items. If you beat the 
prediction either you have built an unusually good test or the 
situation favored high reliability. If not, the odds may have been 
against you or just possibly, you may have done a poor job of test 
construction. 

One inference from Table 2 is that if a teacher can write and a 
student can answer, two true-false test items in less time than is 
required to write or to answer one four-alternative multiple-choice 
test item, preference should be given to the true-false item type. 
Not everyone will agree that a true-false test item is essentially 
the same as a two-alternative multiple choice item, or that an 
achievement can be tested equally well using either type. I believe 
that the differences between the two are of no great significance, 
but this is not the occasion to make a case for the versatility and 
value of true-false test items. Misconceptions are too common, 
and biases too deep-seated to make that an easy task. But it does 
heed to be undertaken some time. 

A number of years ago Remmers (1941) and his students 
published a series of empirical studies of the relation between 
number of choices per item and test reliability. They found that, 
given the reliability of a test composed of items offering, for 
example, four answer choices, they could predict reasonably well on 
the basis of the Spearman-Brown formula the reliability of similar 
tests using items with other numbers of answer choices. While the 
effectiveness of the formula in that application is not wholly 
accidental, it has no strong rational justification. And, of course, the 
Spearman-Brown formula requires, as ours does not, an empirically 
determined reliability coefficient to use as a starting point. 
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NOTE ON A CRITERION FOR THE NUMBER 
OF COMMON FACTORS! 


LLOYD G. HUMPHREYS Ax» DANIEL R. ILGEN 
University of Illinois 


Lmw (1968) has published the results of a Monte Carlo 
approach to the number of factors problem involving the use of 
random normal deviates as additional variables in correlational 
matrices from which principal components are extracted. Horn 
(1966) has made use of an independent component analysis of 
random normal deviates, parallel to the analysis of the “real” 
variables, in order to correct the Kaiser criterion (see Horn) 
which assumes population values of the correlations, for capitaliza- 
tion on chance in the sample. This note briefly presents some 
results obtained from another, and more promising, variation on 
their procedures. 

Linn found that his mean square ratios, analogous to F-ratios 
with data from the latent roots of real variables in the numerator 
and of random variables in the denomination, fluctuated widely 
as ratios were formed from successive roots. The lack of indepen- 
dence arising from the fact that both the numerator and denomi- 
nator were derived from the same analysis, though from independent 
sets of variables, is the probable cause of this variability. Horn’s 
use of a parallel analysis thus is indicated. 

Horn, however, was interested in the Kaiser rule which involves 
the use of unities in the diagonal of the correlational matrix. The 
automatic use of unity for a real variable whose population com- 
munality in any particular set of measures may be Zero, and for 
Tandom variables whose population communalities are zero, follows 
= 
i 1This research was supported by the Office of Naval Research under Con- 
Tact, NOOO 14-67-A-0305-0012. 
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the incorrect model. On a priori grounds the decision as to the 
number of common factors should be based upon the use of an 
estimated communality for each variable. Factor analysis should 
not be confounded with component analysis. 

One objective useful way of estimating communalities is to 
compute the multiple correlations between each variable and all of 
the others. The squares of these values are lower-bound estimates 
in the population; they are also relatively stable from sample to 
sample since they depend upon all of the data. True, they may 
become gross overestimates of communalities in a sample, but this 
property can be kept in bounds by appropriate selection of number 
of observations (N) and number of variables (n). Combinations 
of N and n that create problems in communality estimation are 
also undesirable from other points of view (Humphreys, Igen, 
McGrath, and Montanelli, 1968). 

The use of multiple correlations squared as initial communality 
estimates for both real and random variables has other attractive 
properties. High reliability of the real variables will permit high 
correlations and high communalities with large latent roots. A 
large N will insure small intercorrelations and communality 
estimates for the random variables and, accordingly, small latent 
roots. In a study with well constructed and selected real variables 
to produce high communality and with the intercorrelations based 
upon a large N for sampling stability, the curves of the latent 
Toots in the parallel analyses should separate in nearly optimum 
fashion for the factors that are clearly nonrandom. 


Procedure 


Professor Ledyard Tucker selected four sets of variables from 
four published studies for another purpose and factored thes y a 
several ways. Among other analyses, he included maximum likeli- 
hood solutions, When these data were made available to ‘the 
writers, it was decided to set up parallel analyses with multiple 
correlations squared in the diagonals of the correlational matrices 
and compare the results with those obtained from the maximum 
likelihood method. 

The variables included in these analyses are listed in Table 1 
along with the sources from which they were drawn. Ie d 
dependent analyses were completed within each set: one with the 


~~ 22H 
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TABLE 1 
Variables, Sample Sizes, and Data Sources Used 
1, Inventive Opposites Addition Addition Addition 
2. Completion Multiplication Multiplication Multiplication 
3. Word Grouping Arithmetic "Three-Higher Three-Higher 
4, Concrete Association Figures Figures Figures 
5. Identical Names Cards Cards Cards 
6. Identical Numbers Squares Flags Flags 
7. Highest Number Identical Identical Identical 
Numbers Numbers Numbers 
8, Addition Identical Forms Faces Faces 
9. Multiplication Repeated Mirror Mirror 
Letters Reading Reading 
10. Division 
11. Arithmetic Reasoning 
N = 215 N = 286 N = 710 N = 437 
Thurstone Thurstone Thurstone & Thurstone & 
(1938) (1940) Thurstone Thurstone 
(1941) (1941) 


full number of variables listed and the second with the first seven 
variables. The omitted variables in each case were selected so that 
the number of common factors would be reduced by one. A criterion 
for the number of significant factors should reflect this element of 
the design. 

It was also decided to try one other method of communality 
estimation, and to include a set of factorings using unities as 
well, for comparative purposes. The second method of communality 
estimation selected was the use of the highest correlation in & 
column adjusted in accordance with the absolute level of the 
correlations involving each of the variables.? Three sets of parallel 
factorings involving three different diagonal values in the cor- 
Telational matrices will be compared, therefore, with maximum 
likelihood factors. 


Results 


The first four to five latent roots for the various analyses com- 


pleted are presented in Table 2 along with the chi-square values 


! The following formula was used: 


rn = "(E fu ra) ih (& po 2 
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TABLE 2 
Data from Parallel Analyses and from Maximum Likelihood Solutions 


215 


286 


286 


710 


710 


437 


437 


1i 


Highest Squared 
Unities r-Adjusted Multiples 
Real Random Real Random Real Random 
4.12 1.830 3.75 — .59 3.64 T. ET 
2.36 1.29 2.02 .46 1.89 .38 
1.98. 1.21 ;85 — .97 Tw mdp 
.65 1.10 418" .20 10^ "dm 
.33 1.02 BID) 4.415 .02 .06 
2.04 1.30 2.001 .38 2:4 n 
2.04 1.19 Qu .28 1.56 — .18 
;53— 1.18 120——..13 .01 „11 
.49 .99 3 .09 —.08 .02 
9.97 1.33 3.00 — .46 2.84 32 
1.70 1.20 1:39 ^ .81 1.22 .22 
1.94 1.07 UB. -s22 .61 12 
.69 1.04 o7 75 4.08  .06 
.55 .97 tbe .09 —.04  .02 
2.89 1.24 2.54 41 2387 380 
1.64 1.08 192: 1-28 1.14 D 
84 1.04 BOP uo «28 ll .03 
.64 .99 4 — .07 4.4 —.04 
3.85 1.20 2/89 .91 2.75 24 
1.82 1.14 1.37 .20 1.24 B 
1.00 1.06 .50 aT .35 .06 
.58 . 1.02 Br .12 —.05 .04 
.55 .98 .06 — .05 —.06  .02 
2.07 1.15 2.23 .26 2.07 .18 
1.80 — 1.08 1.32 .21 1.19 .12 
S4 107 ior 2:34 a 0 
57  .99 25 100 —.07 -.04 
3.91. 1.27 3.54 — .26 3.38 27 
11777 3.14 1.44  .95 129  .M 
.86 — 1.13 43 .16 .28 .07 
56 105 o wis —.01 -0l 
.50 1.01 .09  .05 —.09 —.00 
3.13 1.26 2.79  .29 2.61 — .16 
einen ENGL 143  .24 Lona Well à 
.06 1.05 i21 — 10 .0 — 07 
46 — .99 32. .05 — 10; one 
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for the maximum likelihood solutions. Each block of data will be 
discussed in turn. 

For the first block of data the parallel analyses involving the 
squared multiples give the most clear-cut answer to the number of 
factors question: the curves cross quite sharply between three and 
four factors. Both the use of unities and the maximum likelihood 
solution suggest three and reject four factors while the highest 
r-adjusted data accept three but do not unequivocally reject four 
factors. 

With the omission of four variables from the first analysis, 
both the unities and the squared multiple correlation methods in- 
dicate two factors. Maximum likelihood estimates are somewhat 
equivocal for two or three factors, and the highest r-adjusted 
curves do not cross even after three factors. 

For the third block the maximum likelihood chi square is below 
the .05 level by .01 for four factors, and the squared multiple 
analysis is equivocal for four. In this case the unities analysis 
gives an apparently more clear-cut answer, but ambiguity for this 
set of data may actually represent the truer picture. The clustering 
within the table of intercorrelations does not indicate a neat three 
factor solution. 

The ambiguity observed in the third data block is replicated in 
the results from the fourth. The fourth matrix was identical with 
the third except for the loss of two variables which were supposed 
to help define a factor, but did not do so very well. Both maximum 
likelihood and the squared multiples present evidence for three 
factors, and with only seven variables a fourth is indeterminate. 
The unities analysis suggests two factors, but the answer is more 
clear than the data. The highest r-adjusted data are not very 
helpful and will not be referred to again. 

In the fifth data block both the squared multiples and the 
maximum likelihood data show three factors with the latter in- 
dicating a fourth as possible. The curves based on unities cross 
after only two factors, yet three factors are clearly present in 
the intercorrelations. 

The several criteria lead to somewhat differen 
the sixth block. The unities data indicate two factors, maximum 
likelihood suggests at least three, while the squared multiples 
Support two strongly with a third ambiguous. 


t conclusions in 
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The consensus from block seven is for three factors. The unities 
analysis indicates only two, however, while the squared multiples 
and the maximum likelihood solutions do not clearly reject four, 
Again, the intercorrelations indicate at least three. 

In the last block two factors are indicated although the maxi- 
mum likelihood solution does not reject a third with high con- 
fidence. 

Table 3 presents a box score for the several methods compared, 
A factor is counted in the maximum likelihood solution if p < .05. 
In the absence of probability values for the parallel analyses 
approach, statements such as “does not reject” a particular number 
of factors have been introduced when the difference between the 
latent roots from real and random data were within a few 
hundredths of each other. 

lf we were to establish the maximum likelihood method as the 
criterion for the number of factors, parallel analyses involving 
Squared multiples more closely approximates the “correct” number 
of factors than the method involving unities, or the highest 7- 
adjusted values. If we reject maximum likelihood as the criterion 
and look only for the method which has most in common with 
the others, it appears that the parallel analysis procedure with 
Squared multiples is closer to the centroid of all than any other 
including maximum likelihood. 

Parallel analysis starting with unities in the diagonal does 
generally produce two curves of latent roots that cross at a sharp 


TABLE 3 
Box Score on Number of Factors Indicated by Each Method in Hach Analysis 
Aud Squared Maximum 
Data Block Unities r-Adjusted Multiples EET 
luna Adjusted’ 3" Mulüples eee 
Da 3 ; : ; 
15-7 2 ; 3 
ido 2 t 3 (doesnot 3 (p < -10 for 4) 
Ec eas 3 (cannot 
t 
i : : det 4) evaluate 4. 
710-9 3 ; i i 
pee z ? 2 (does not 3 
reject 3) 4 
"S E ? 3 (docs not  3(p« 18 ford) 
437-7 2 1 reject 4) sd tor) 
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angle. For those interested in unambiguous answers, this is the 
preferred technique. It quite clearly underestimates the number of 
factors that should be retained in two of the present eight analyses, 
however, and in larger matrices based upon larger Ns it would 
greatly underestimate the number of stable common factors 
(Humphreys, 1964) . 

Some additional work has been done with parallel analyses with 
squared multiples, and the method continues to look promising. 
Sampling distributions of latent roots from random data are being 
developed empirieally, also. The present writers are willing to 
recommend routine use of this technique in determining the number 
of factors to rotate and interpret. It is less expensive in computer 
time than a maximum likelihood solution. I& can also be used 
concurrently with inspection of the curve of the latent roots for 
“breaks” as Linn recommends and as incorporated in the “Scree” 
test (Cattell, 1966). It should correct a basic difficulty of the in- 
spection of roots technique, i.e., the typical finding of several widely 
spaced breaks, by narrowing the choice down to a particular 
break. 

In recommending the use of this technique the writers do not 
necessarily recommend the rotation and interpretation of all of the 
supposedly nonrandom factors. It is still possible to discard replic- 
able factors on psychological or psychometric grounds. A very 
small contribution to variance in a particular analysis based upon 
a large N can be overlooked with good conscience. It is not as 
easy, on the other hand, to interpret factors in good conscience 
that cannot be distinguished from factors derived from the in- 
tercorrelations of random normal deviates. 


Summary 

cedures used by Linn and by 
mmon factors, we have 
f real and random data 
used as initial estimates 
matrices of real data 


Using as points of departure pro 
Horn in determining the number of co 
Investigated the use of parallel analysis 0: 
in which squared multiple correlations were 
of communalities. Eight small correlational : 
Were factored in parallel with eight matrices of random data using 
in succession unities, highest r-adjusted, and squared multiples 
às initial communality estimates. Results were compared with 


Maximum likelihood solutions. The analyses involving squared 
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multiples provide answers to the number of common factors that 
are generally closer to maximum likelihood determinations than 
any other. The squared multiple procedure also has most in com- 
mon with decisions made on the other bases, not excluding maxi- 
mum likelihood from consideration. The technique is worthy of 
consideration for routine use in making the number of factors de- 
cision. 
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RESTRICTIONS ON THE POSSIBLE VALUES 
OF r;», GIVEN 11s AND 123 


JULIAN C. STANLEY AN» MARILYN D. WANG? 
The Johns Hopkins University 


In the final stages of preparing the manuscript of their new 
statistics textbook for the printer, Glass and Stanley (1969) dis- 
covered that they had offered without proof a formula for the 
restriction on the possible values of 712, the Pearsonian coefficient 
of correlation, in terms of r,s and res. The investigators searched 
a number of textbooks in psychology and education to determine 
whether proofs appear there. None were found. The basic proof seems 
due to Yule (1897). A proof is more readily available in Yule’s 
statistics textbook (e.g., Yule and Kendall, 1950, p. 301). 

The investigators offer herewith a simple algebraic version that 
makes clear the nonmystical basis for these limits. Let us begin 
with ri.s, the first-order partial coefficient of correlation for the 
correlation of variables 1 and 2 when variable 3 is “partialed 
out.” The familiar formula (e.g, see Hays, 1963, pp. 574-576) is 


19 Tig — Trales T 
Tiz = ü = ned == ta) , ¢ ) 
where neither 713 nor ro; = —1 or +1. 

Yule and Kendall (1950, pp. 285-287) and others show that the 
Partial coefficient of correlation (in general, T1-34,..n) “may really 
be regarded as, and possesses all the properties of, a [Pearson prod- 
Uct-moment] correlation coefficient . . .” (p. 285). Therefore, it can 
vary in magnitude from —1 to 1, inclusive. Thus we may write 


XE St an 
1We thank Gene V Glass and Robert A. Gordon for their comments con- 
cerning an earlier version of this note. T 
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-1 $n:5 <1. Q) 
Via (1),the inequality of (2) may be rewritten 


Ti) — TisT23 1 (3) 


ES <1. 
V (1 = Tia )(1 € fas) 


Multiply all three parts of the inequality by 4/ CERET — nd) 
which is a non-negative quantity, and then add rra to each part. 
This produces the usual limits formula: 


Tu — V(l—mns ya — I5) Sts 
S rura + V (1 == tis )(1 = fii): (4) 


For visual simplicity it may be written that the limits of 
Tiz are 


Tifa dE V (1 — nus) =- Ta). © 


If 743 and rs; are both 0, the limits for rx are the customary +1, 
If ris = ry = k, formula (5) yields the limits (2k? — 1) and 1. Eg, 
for k = 8 the extreme possible values of riz are 28 and 1. The umy 
would be 0 and 1 for k = .707. If ris = k and res = —k, the limit 
become —1 and (1 — 2%?) ; for r4 = .8 and res = —.8 the limits are 
—1 and —.28. 

Tf ris = k and mas = 0, the limits are + V(I — E). For k = $ 
(or —.8) the limits are —.6 and .6. 

Despite the indeterminacy of rj. itself when ris and/or res taka 
the value —1 or 1, formulas (4) and (5) apply even then. For exam- 
ple, if ris = ros = 1, rj; must be 1. Similarly, if rı = 1 and "a 
= —l, r must be —1. This accords with one’s intuition. If Brow 
is a staunch Democrat, whereas Smith is a staunch Republican, you 
can agree politically with one only by disagreeing with the ph 


Mazimum, Negative Average Intercorrelation 


One can generalize the restraints on ri; beyond the three vari 
situation. Also, one can show that the maximum negative mean the 
tercorrelation, such as that for (ry2 + ris + 795) /3, cannot attam pe 
value —1 if there are more than two variables. In fact, for E es = 
bles the maximum possible negative mean r is —1/(k — 1). P 
2 this is —1. For k = 3 it is —15. For k = 101 it is —0.01 


| 
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means, for instance, that the 101 items in a test can, at worst, inter- 
correlate on the average only slightly less than 0? 

Thus, it is well to keep in mind that the +1 limits for two- 
variable (i.e., zero-order) r's do not necessarily hold when all 
possible 7’s among more than two variables are involved and some 
of the r’s are already specified. 
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2 The proof proceeds simply, for k = 3, by adding ts + 7s to each part of 
og (4) and then dividing each part by 3. This leaves the mean r in the 
le, and 


Tis + Tos + Titos — V q m fi ra - Taa) 
3 


as the lower limit. When rs = ra = —5 (and hence the minimum Ts also 

ìs — 5) this lower limit is ie Mam above, and this is the largest pos- 

sible negative value, When ns = 7» = —1 (and hence ri = 1), m Ly 

Ma vt less negative than —14. For a general proof see Stanley ani ley 
; p. 63) 
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METHOD EFFECTS IN JUDGING THE 
DESIRABILITY OF TRAITS 


WILLEM K. B. HOFSTEE 
University of Groningen, The Netherlands 


Desmapmity judgments of trait-descriptive items have been 
collected by several investigators for a variety of reasons. One 
such aim is to obtain item indices to be used in the construction 
of forced-choice tests (e.g. Bartlett, Quay, and Wrightsman, 1960; 
Denton, 1954; Dunnette and Kirchner, 1960; Edwards, 1954; 
Ghiselli, 1954; Gordon, 1953; Heineman, 1953; Waters and Wherry, 
1962), in cross-cultural studies (e.g. Edwards, 1957; Klett and 
Yaukey, 1959; Lovaas, 1958), and in studies comparing social 
desirability mean ratings with other item indices, in particular 
frequency of endorsement (e.g. Cruse, 1965; Edwards, 1953; 1966). 
Another goal is the study of individual differences; & number of 
investigators (Feldman and Corah, 1960; Heilbrun and Goodstein, 
1961a; 1961b; Jackson, 1964; Klett, 1957; Kogan, 1962; Messick, 
1960; Saltz, Reece, and Ager, 1962; Scott, 1963; Wiggins, 1966) 
have called attention to the multidimensionality of the social 
desirability concept, or alternatively, to systematic individual 
differences in desirability judgments. In view of the diverse and 
widespread use that is made of these judgments, it is of interest 
to know whether sources other than differences between items and 
between individuals, together with their interaction, are influential. 
The present study represents an extensive search for method vari- 
ance in desirability judgments of traits. Specifically, position and 


land, where the data were processed. Dr. L. R. Goldberg deserves much th: 
or his valuable comments on a draft of this paper. 
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sequence effects as determinants of item desirability, and response 
sets as determinants of individual differences, were investigated, 

Position effects, in this context, are systematic changes in the 
judgment of items over time, as a function of experience with | 
similar items, rather than of learning outside the testing situation. 
Position effects have been extensively studied in psychophysics 
(Helson, 1964; Johnson, 1955); in the field of testing, which 
provided the fertile ground for a rapidly growing interest in 
desirability judgments, they have not been given much attention. 

One important exception is a study by Gordon (1952), who 
investigated the position effect in self-description. Gordon ad- 
ministered 150 personality items in five booklets of 30 items; the 
order in which the booklets were presented was varied. Subjects 
Ss) indicated the extent to which an item applied to them on a 
five-point scale. It was found that desirable traits were judged more 
applicable, and undesirable items less so, as the list continued. In 
view of this finding, it is reasonable to suspect the existence of 
position effects in desirability judgments of personality items Su 
ilar to those used by Gordon. Accordingly, a counterbalaneing 
design with respect to position was applied in the present study to 
see whether Gordon’s results could be extended to another domain. 

Sequence effects are defined here as biases due to the particular 
Sequence in which stimuli appear; if a trait is preceded, for example, 
by an extremely desirable item, it is conceivable that the trait 
would be judged relatively desirable or relatively undesirable 
through some assimilation or contrast effect. Such a sequente 
effect was found to be operative in a study by Hofstee (1966); 
again in a self-description task. The study tried to explain why 
certain alternatives in a forced-choice test were endorsed more often 
than other alternatives, even though the traits were matched 
popularity on the basis of absolute judgments. It turned out ue 
the popularity index represented an overestimation if the He 
Was preceded by à more popular trait in the single-item eur 
and an underestimation if the preceding trait was less popu ie 
this means that assimilation took place in the absolute ne 
For a more direct study of sequence effects, it is Dee ee T 
vary item sequence along with position; this requires 9 pus 
intricate design than Gordon’s (see above), who had most 1 
preceded by the same item throughout. 
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Response sets. At first sight, it may seem unwarranted to look 
for response sets in desirability ratings (except for social de- 
sirability set, which is what Ss are asked for). Yet in a formal 
sense, such tendencies as acquiescence and extreme response set, 
which have become more or less associated with self-report, might 
have counterparts in the judgment of item desirability. If present, 
sets should be detectable in the correlational or factorial structure 
of the judgments. 


Method 
Items 


Trait-descriptive adjectives (together with some nouns and 
short expressions) were used as items. A list of 260 items, collected 
for general purposes at the Psychology Department of the Univer- 
sity of Groningen in Holland, provided most of the material; 
80 traits were eliminated because of verbal difficulty, and 16 were 
added to obtain 196, the square of an even number required by 
the research design. The final list has been presented in a pre- 
liminary report (Hofstee, 1967). 


Procedure 


From the total set of adjectives, 14 versions were printed with 
the same items in different orders. The design for rotating the 
items was based upon a latin square, presented in Table 1. In 


TABLE 1 
Counterbalancing Design for Rotating Position and Sequence of 196 Items 


i Wersion Position 
I l1 359 ale QOTSA e 
I 8. 105 2.2. AS OA SAO E tee 
m ? 4 1 6 37a Ab 
Iv HERE vr speciei ies 
Y 1-8. 2... 8, 1109 98/0 2A E 0 19 8 Du 
NI 2T 9 a n qu ote Cy enis 
VII o iaioa 140. Bae meee S eee Tie 
vm HERR HER T To PONI T Md 
Ix Bde ena 4 D MODUM me ROAD UNT 
X dg Hd NI ER. C UE 
XI Toa vg 141 08 CORDERO A BED 
xi 1p IL 1d: 8 12 ede P o HEADER EE 
Xu 111610 Yel S ao du E 
xiv idus 093090 OTIO ED 9 12 0 
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order to vary position as well as sequence, the square was used 
in a nested (or quadratic) fashion: a number stands for a block 
of 14 traits, and also for one particular trait. The 14 blocks of 
items were rotated according to the design, and so were the 14 
items within each block. Thus the item numbered 101 in version 
I, for example, being the third item in the eight block (7 X 14 
+ 3 = 101), appears in version XI as the 13th item (because 
the “3” stands in the 13th position) in the third block (because the 
“8” stands in the third position), which gives the position number 
of 41. This design has the following properties. (a) The average 
position is equal for all items: (1 + 196)/2 = 98.5, since 
versions I and XIV, II and XIII, and so on, are mirror images 
with respect to order. (b) The 14 positions occupied by an item 
in the different versions are about equally spread for the different 
items: in particular, each item occurs once among the first 14, once 
in the second block, and so on. (c) A particular item is never 
preceded more than once by the same item: it is preceded once by 
each member of its block, and once by an item from another 
block (unless it appears first in a version). 


Subjects 


Young men who were being screened for fitness at the Military 
Induction Centers of the Dutch Forces as Ss. The majority i 
testees at the Centers were 19 years old. Since military service 1$ 
compulsory in the Netherlands, Ss were approximately n pec 
tive of the national age group. The 14 versions were distributed à 
Tandom among Ss. The lists were administered as a test among | 
other tests by the regular personnel. Each version was filled out by 
Some 900 Ss, for a total of a little over 12,500 Ss. 


Instructions 


Ss were instructed to indicate for each trait whether, in general 
it was a very desirable, desirable, neutral, undesirable, oF ded 
Undesirable trait. It should be noted that social rather than tole 
sonal desirability was emphasized in these instructions; the lists 
jective” slant was even accentuated by the heading of the ran 
which read “Test of Judgment.” Still, it would not ee to 
reasonable to expect, personal conceptions of social desirability 
show up in the Tesponses. 
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Frequency 
PC 
o $ 


© 


1.6 2.0 2.4 2.8 3.2 3.6 4.0 4.4 
Desirability scale value 


Figure 1. Distribution of desirability indices of 196 traits. 
Results 


Responses were treated as ratings on an interval scale ranging 
from one (very desirable) to five (very undesirable). Version 
means and standard deviations of each item, taken over 900 Ss 
at a time (each item has 14 such Version means), were computed 
as well as the grand mean and standard deviation of each item, 
taken over all Ss (and all Versions). 

The distribution of the 196 grand means is given in Figure 1j 
which shows the bimodality that is characteristic of any relatively 
unselected list of traits (cf. Cruse, 1965; Edwards, 1966). 

As was anticipated, the 14 Version means for any particular 
trait showed considerable fluctuations around its grand mean. The 
observed standard deviation of the Version means ranged from 
04 to .15 for the different traits, with a median SD of .09; 
fluctuations were largest for the neutral traits, and smallest for 
the undesirable traits. The figure of .09 may be contrasted with 
the expected SD of a (grand) mean, when random samples of 
some 900 observations are taken. Since the standard deviations of 
ratings for a particular trait (grand SD’s) ranged from .64 to 
134, the expected SD of the mean ranges from .02 to .04, values 
which are three times as small as the observed SD's. 

The deviations of the Version means from the grand mean of a 
trait were found to be highly predictable on the basis of position 
and sequence effects. 

Position 

For each trait, the Version means were correlated with the 

Positions (rank numbers of the positions in which the trait ap- 


Peared). It was found that the over-all desirability (grand mean) 
Was a distinct moderator (Saunders, 1956) of the position effect: 
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correlations were positive for the desirable traits, negative for the 
most undesirable traits, and close to zero for the neutral ani 
slightly undesirable traits. This means that desirable traits were 
rated less desirable and undesirable traits less undesirable as 
the list continued. The correlation between trait desirability (grand 
mean) and position effect (as measured by the correlation between 
positions and Version means) was —.77; no departure from linearity 
was observed in the scatter plot. The regression line for estimating 
the position effect from over-all desirability is the straight line in 
Figure 2. The other lines in Figure 2 are explained below. 

For further illustration, regression lines of Version mean on 
position, averaged for 14 traits at a time—the 14 most desirable 
traits, the next 14, and so on—are presented in Figure 3. It can 
be seen that the position effect is quite marked, especially for 
extreme traits: the 14 most desirable traits, for example, were 
judged less desirable on the average at the end of the list than 
the next 14 at the beginning. Failure to control for position effect 
could thus easily lead to faulty conceptions of the relative de- 
sirability of a trait. 


Sequence 
For 187 of the 196 items, the correlation between the grand 


n 

OQ O ^M 

o o 

ec o LJ 

$ $8 wy 
OS 0c e 


2 2 yt 
fim a ys P ns 
.60. +40 
.4C 220 
+20 
2 
+00 nits 1.77M = .31M°- 1.777 
7.20 
73440 


f = -.50M + 1.76 


pos 3 
7.60 Se m 


1.8 2.2 2.6 5.0 3.4 . 
Grand mean scale value (M) 


A jrability 
Figure 2. Position effect, sequence effect, and joint effect in desirab 
ratings as a function of trait desirability. 
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Figure 3, Average position effect for 14 groups of 14 items (lines are least- 


: squares linear fits). 
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mean of the preceding item and the Version means of the item | 
itself, taken over the versions, was positive. This implies that the | 
sequence effect was one of assimilation: the more desirable the 
preceding item, the more desirable the item under consideration, | 
and the more undesirable the preceding item, the more undesirable. 
Again, the over-all desirability of the traits moderated the sequence | 
effect: correlations were highest for neutral traits, and lowest for 
undesirable items. The regression line for estimating the sequence 
effect from over-all desirability is the lower curve in Figure 2; | 
the corresponding eta coefficient, taken over 10 classes with equal 
intervals, was .48. 

Since position and grand mean desirability of the preceding item 
were essentially uncorrelated (because of the random order in which 
the traits were presented), the multiple correlation predicting Version 
means from position and sequence may be estimated as Amun = 
BACH + f^); estimated rather than observed first-order eor 
relations were used because the latter capitalize on error. This 
function is depicted in the upper curve in Figure 2. Clearly, the 
fluctuations shown by the Version means were least predictable 
for negative traits; however, this is where the observed fluctuations 
tended to be smallest. On the average, almost 2/3 of the variance 
between Version means was predicted by the position and sequence 
effects; the remainder was probably due to sampling error. Several 
other hypotheses were investigated like beginning and ending effects 
over and above the linear position effect, inverse sequence du 
and second order sequence effect (influence of the following tral 
and the second preceding trait, respectively), and qualitative con a 
effects; for none of these could other than marginal support 
found (for full details, see Hofstee, 1967, p. 17ff.). 


Response set 


tini 
In order to investigate whether response sets were effecting 


these judgments, a sample of 40 items was selected by taking 6*7 

fifth item (starting with the first) of the 196 ranked accord 

grand mean desirability. The responses to these 40 items ical 

correlated, and the resulting matrix was subjected to & M be 

factor analysis. Before looking at the results, an attempt wi 

made to predict what should happen under various assumptions. that | 
1. Insofar as Ss are manifesting no consistent response sets, 
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is if all Ss assign the same true value to a trait, their responses 
would differ only as a function of random error, and consequently 
correlations between the traits should be close to zero. The cor- 
relation matrix between a number of weights, for example, all 
measured by a number of scales that are inaccurate but free from 
systematic error, should be of this type. 

2. Insofar as Ss differ in leniency, that is if some Ss have a 
tendency to use the “desirable” end of the scale, while others 
tend towards the “undesirable” end, correlations should all be 
positive, giving rise to a general factor on which all items have 
loadings of the same sign. 

3. If Ss differ in extremeness response set, this would not in 
itself influence the correlational or factorial structure, since no 
direction is involved. However, if some Ss have & tendency to rate 
desirable traits extremely desirable, and undesirable traits extremely 
undesirable (as opposed to a “pure” extremeness response set), 
and other Ss less so, then correlations between two desirable 
traits and between undesirable traits should be positive, while 
correlations between a desirable trait and an undesirable trait 
should be negative. A bipolar factor, carrying loadings that are 
highly correlated with trait desirability, should account for the 
structure of the relationships. 

4, If different viewpoints are operative, that is if some Ss have 
a relative preference for certain kinds of traits, other Ss for other 
traits, some correlations should be positive (e.g. between “silent” 
and “uncommunicative”), others negative (eg. between “silent” 
and “talkative”) ; one or more factors should be found that can be 
Meaningfully interpreted on the basis of item content. 

The 40 traits are presented in Table 2, together with grand 
means and factor loadings. The adjectives are English approxi- 
mations of the Dutch traits; they should be read against the 
background of the desirability indices, since it appeared impossible 
to find translations representing both qualitative meaning and 
desirability of the original in a number of cases. The canonical 
factor analysis yielded a large number of significant factors; seven 
factors are reported here since no factor beyond the seventh 
showed more than one loading over .20. 

Clearly, the first factor represents “extreme respon 
desirability” as defined above (see 3); the second, leniency (see 


ding to social 
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TABLE 2 
Desirability Indices and Factor Loadings (X 100) of 40 Traits 


Mi nu ur Iv v vr vie 
1. Trustworthy 1.63 48 22 —01 01 —02 03 —06 % 
2. Self-reliant 1.78 54 23 12 —03 —02 05 00 39 
3. Friendly 1.82 43 28 —26 —12 —08 06 —06 36 
4. Neat 1.8 39 28 —16 —01 -07 27 -00 3 
5. Determined 1.89 56 23 16 —10 05 —10 10 # 
6. Likable 1.92 57 22 06 OL -11 —13 -06 41 
7. Controlled 1.97 53 25 07 14 -04 15 -OL 9 
8. Enterprising — 1.99 55 17 18 —06 —00 —13 08 9 
9. Contented 2.02 34 28 —27 —05 —03 06 —00 2% 
10. Frank 2.05 43 24 —05 02 —18 —02 —04 % 
11. Quiet 2.07 39 31 —09 14 10 0 @ @ 
12. Merry 2.15 40 28 —12 —01 —08 -13 03 Z% 
13. Self-assured 2.18 32 26 20 —16 —03 05 02 2 
14. Sober 2.00 33 26 08 09 o7 0 0 39 
15. Tenacious 2/4. 35 25 09 —12 15 —13 9 95 
16. Composed 2.32 36 29 —02 21 04 10 04 7y 
17. Domesticated 2.36 19 22 —18 06 00 06 02 P 
18. Stable 2.48 36 17 15 18 07 —26 10 T 
19. Generous 2.54 08 20 —04 —09 —06 -27 06 5 
20. Indulgent 2.70 03 26 —26 —02 —08 —08 —02 2 
21. Silent 2.90 —05 30 01 12 35 04 -0 2 
22. Refined (-) | 3.08 —09 22 14 —32 -03 19 00 
23. Uncommuni- 2 
cative 3.14 —14 27 07 05 35 —02 -0 if 
24. Sturdy (—) 3.25 —07 22 07 —19 00 -09 O 5 
25. Moody 3.36 —30 23 —06 —17 —07 2% —0 5, 
26. Strange 3.42 —34 32 11 06 00 —03 — oF 
27. Odd 3.50 —32 32 11 09 —02 —03 —20 i 
28. Elderly 3.58 —30 27 -01 02 03 (08 Oy 
29. Imbalanced 3.56 —38 16 —09 07 07 -12 06 x 
30. Confused 3.68 —47 29 —04 —02 02 —08 —20 37 
31. Irresolute 3.70 —51 26 —18 —02 00 06 —0 5 
32. Touchy 3.76 —43 23 03 —05 08 —15 -I1 $3 
33. Melancholic 3.79 —46 33 —02 —01 —00 03 ae 37 
34. Uncontrolled 3.86 —54 20 —07 —16 02 —0l tg 40 
35, Pushy 3.87 —50 24 04 —21 -03 -21 işs 3 
36. Unfriendly $.90 —54 20 12 08 —01 —03 rg 
37. Slavish 4.05 —49 26 00 05 -12 0 £0 g 
38. Meddler 4.08 —52 23 09 —03 -09 -01 17 3 
39. Sluggish 415 —50 25 08 22 -09 — a 45 
40. Cowardly 4.38 —58 19 07 15 —10 —-0$ == 
Per cent of o3 11.78 


common variance 57 21 05 05 04 05 
nt of the common 
determined, TP” 


h of whi 
f eac show 


2). Together, these factors account for 78 per ce 
variance. The other factors appear to be content- 
resenting points of view (see 4), the contribution 0 
is small. Still, it is important that viewpoint variance dos 
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up even under social rather than personal desirability instructions. 
The loadings were obtained through a Varimax rotation of factors 
III to VII, leaving the factors I and II intact. Most of the factors 
are clearly interpretable, even though the loadings are small. 


Factor III’ carries the following traits: 


13. Self-assured 20 
8. Enterprising 18 
5. Determined -16 
versus: 
9. Contented —.27 
3. Friendly —.26 
20. Indulgent —.26 
17. Domesticated —.18 
81. Irresolute —.18 
4. Neat —.16 


It could be named “preference for assertive traits" or "masculine 
stereotype.” 


Factor IV’ shows the traits: 


39. Sluggish 22 
16. Composed 21 
18. Stable 18 
versus: 
22. Refined (—) —.32 
35. Pushy —.21 
24. Sturdy (—) —.19 
25. Moody En 
34. Uncontrolled —.16 
13. Self-assured —.16 


It could be interpreted as “preference for passive as opposed to 
acting-out behavior.” 
Factor V’ carries Silent (.35) and Uncommunicative (.35) and, 


on the negative side, Frank (—.18). Factor VI is hard to interpret: 


4. Neat 27 
25. Moody .25, 
22, Refined 19 
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versus: 
19. Generous —.27 
18. Stable —.26 
35. Pushy —.21 


The general idea is perhaps “feminity.” : 
Factor VII' seems to be most clearly defined by its negative 
pole: 


40. Cowardly 21 
15. Tenacious 19 
36. Unfriendly 18 
38. Meddler 17 
versus: 
26. Strange —.21 
27. Odd —.20 
30. Confused —.20 
33. Melancholic —.17 


“Strange as opposed to straightforward behavior” is probably t 
best interpretation. 

Since none of the viewpoint factors accounts for as much as 
2 per cent of the total variance, the safest general interpretation 
is probably that the viewpoints exist, but are shared by only 8 
minority of the subjects. :ved i 

The correlation matrix from which the factors were derived i$ 
shown in Table 3. ; ta 

When the intercorrelations are averaged eight variables a 
time, eg. 1 — 8,9 — 18,...33 — 40, the basic two-fac 
structure becomes evident. 

Discussion 5 

On the basis of these results, it is clear that method ye 
loom large in desirability ratings of the kind used in this 5 i 
Before discussing the implications of the findings, nie 
attempt should be made to clarify the nature of the phenom 
that were observed, 

That the position effect should be one of convergence kc 
the midpoint is interesting enough: Gordon (1952) found pre 
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the opposite effect (divergence). The one clear difference between 
the studies is in the instructions: in Gordon's research, Ss judged 
the extent to which a trait applied to themselves. Two other 
studies, where context effects rather than position effects were 
under investigation, show a highly similar discrepancy. Cowen 
and Stiller (1959) found that desirable traits were rated relatively 
undesirable when placed in a context of desirable items only, while 
neutral traits were rated relatively desirable in an exclusively 
neutral context. A study by Young, Holtzman, and Bryant (1954), 
on the other hand, gave almost exactly the opposite results with 
respect to the applicability of a trait. In their research, the extent 
to which an item applied to a peer was rated. Positive traits were 
rated extremely applicable when the list consisted of positive 
traits only, and negative traits were rejected more extremely when 
the context was negative. 

The analogy made by these four studies is worth considering, 
especially because the results in the latter two were interpreted 
by the authors independently in terms of Adaptation Level theory 
(Helson, 1947). If only positive traits are being presented, AL 
(the stimulus value that leads to a neutral reaction) becomes 
positive, so any particular positive trait will be rated less positive 
than in a mixed context, since the difference between it and AL is 
relatively small. If, on the other hand, people (or the Self) are 
being judged, Ss may apply positive traits more readily, because 
these traits are no longer as extremely positive as they could be. 
More formally, in this model the person that is being rated is 
apparently thought to have a neutral or mildly positive (at least 
not extremely positive) position on the scale; as the desirability 
of traits decreases towards this point through adaptation, appli- 
cability increases. 

This is a plausible explanation of what happens to desirable 
traits, but it fails to account for the finding that undesirable 
traits are judged less applicable when placed in a context of un- 
desirable items. Only if the person to be rated were at the very 
negative end of the scale, could applicability of negative traits 
decrease as a consequence of an adaptation that takes off the 
sharp edges. Apparently this implication was neglected by Young, 
Holtzman and Bryant (1954). 


While the position effect as observed in the present study could 
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Correlation Matriz of Desirability I 


WINN NNER EE ee eee 
SSSSSRSSSSESSRENSRRSEESSRNUSAERERESOmNOURONE 


. Trustworthy — 
. Self-reliant 31 — 

Friendly 30 27 — 

Neat 28 28 33 — 

Determined 52999231! g 
. Likable DAAA BL E 255.37 — 
. Controlled 29 37 25 33 32 35 — 
. Enterprising 2g 947719 44 ' 38 31 — 
. Contented 230091733 20; 20 22 22 20 — 
. Frank 26 29 26 26 28 32 33 27 22 — 
. Quiet 26 26 28 25 26 25 31 22 25 19 — 
. Merry 24 927 29 21 28 31 25 2 24 26 24 — 
. Self-assured 22 26 17 18 28 25 23 24 14 19 16 18 
. Sober 21 25 18 19 25 23 26 21 14 20 23 13 
. Tenacious 21 24 19 14 38 22 21 30 16 16 19 214 
. Composed 23 23. 20 23 24 26 33 21 21 22 3l n 
. Domesticated 13 134 17 22 13 14 13 13 19 15 15 a 
. Stable 19 23 11 09 27 27 23 30 11 16 19 9 
. Generous 06 08 08 01 09 18 05 09 08 11 06 Mj 
. Indulgent 07 02 14 10 Of 06 05 OL 15 13 H Ma 
. Silent 06 03 01 05 03 or 05 00 04-0 13 Thu 
- Refined (—) -00 02 03 04 02 00 02 00 01-01 0? f 
- Uncommunicative —03 —01 —03 —01 —01 —03 —00 —02 —00 —05 H NE 
. Sturdy (—) -01 02 02 01 03 O1-01 05 03 00 OL roa 
- Moody -09 —09 —01 02 —14 —15 —11 —14 01 -07 ~06 2 
. Strange =07 -11 —07 -07 —11 —11 —10 ~13 -05 -08 -04 TOT 
. Odd —08 —10 —07 —07 —10 —10 —09 —11 —04 —06 me 
. Elderly —10 —08 —04 —02 —11 —13 —08 —11 —02 —06 — 0 -08 
. Imbalanced T14 -18 —13 —14 —18 —19 —16 —18 —00 -12 -10 -is 
. Confused zis —19 —09 —11 —20 —20 —18 —20 —07 -1 — Lg 
. Irresolute -18 —28 —09 —08 —27 —25 —19 —29 —05 -15 -11 Zio 
 Touchy =15 —18 —14 —14 —17 —17 —18 —18 -1] -13 M Ty 
. Melancholic -14 —18 —08 —06 —21 —19 —15 —23 —07 —10 -0 Tie 
. Uncontrolled -24 —23 —15 —14 —25 —28 —26 —26 —09 -17 -15 218 
. Pushy —18 —21 —13 —13 —19 —24 —22 —22 —09 —16 15 -18 < 
. Unfriendly -21 —24 —23 —18 —22 —26 —23 —23 —16 -19 — Tig 
. Slavish =17 —21 —14 —08 —22 —22 —18 —28 —08 -13 -lg 115 
- Meddler -20 -21 —19 —15 —21 —24 —22 —22 —14 -16 P Dg 
. Sluggish -17 —21 —19 —16 —22 —21 —18 —21 —14 -15 TT) Tig 2l 
- Cowardly —26 —27 —22 —20 —27 —28 —23 —26 —16 —19 


" of 
readily be explained by AL theorists as just another example 


the central tendency effect that is expected when anchors ited 

absent (Helson, 1947), such an interpretation is clearly lim 

because it fails to account for other phenomena that seem 

highly related. i à 
An alternative hypothesis is that ratings of items and ratings ot 

persons (Self included) tend to be mixed up by Ss in | n 

run. Basic to the hypothesis is the assumption that desira! i 
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E (Decimal Points Omitted) 


Ded 

M 0 n — 

[i 10 00 15— 

1 07 06 0405 — 

Ol —04 —10 050204 — 

05 02 03 040228 05— 

00-01 05 110405 1207 — 

05 03 —21 —03 05 06 200706 — 

03 —02 —05 010512 09 1405 15 — 

-03 —01 —03 0106 13 08 13 06 15 33 — 

-0» 04 —07 —01 05 13 08 12 07 17 19 19 — 
-0 —02 —08 04 05 10 —00 09 04 12 16 15 19 — 
-l0—03 —14 0108 10 08 16 08 19 28 25 21 22 — 


11 —02 —19 —01 10 10 10 12 07 25 24 22 21 27 33 — 

ll —05 —09 03 04 09 07 18 09 15 22 19 14 22 36 26 — 

07 —02 —14 —00 09 13 11 15 11 22 29 28 23 21 35 33 29 — 

-17 ~06 —20 03 03 07 14 13 12 22 24 21 23 26 30 pei tert 

-16 -05 —14 02 0405 16 11 16 22 22 20 22 19 31 29 29 31 3 32 — 

12-10 —12 —11 01 10 07 13 10 18 25 22 21 25 27 20 25 28 33 55 my 

+10 ~03 —15 —o1 06 07 11 10 07 22 23 24 23 21 30 31 23 30 30 6.35.32 — 
[12-06 —12 03 03 06 11 11 12 23 21 20 21 21 27 20 26 27 33 30 35 57 3g — 
-07 ~03 —06 01 02 10 02 12 07 17 25 26 23 25 20 31 25 20 20 29 $5 38 37 4l — 
12 ~08 —13 000100 06 11 06 18 22 22 20 25 28 30 28 29 33 


indices of traits should have a flatter distribution than applicability 
indices of the same traits. If such is the ease, then item indices 
should move away from the midpoint as Ss tend to forget about 
the applicability instructions and gradually shift towards respond- 
ing in terms of trait desirability; and regression towards the 
midpoint should be found as “subjective” considerations gradually 
enter into judgments of trait desirability. 


There is a certain logic to this "mixture" hypothesis, unusual 
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though it may be. The suggestion that in self-ratings Ss tend to 
move away from the instructions and pay more and more attention 
to trait desirability is not at all new; what is held by the hypothesis 
is that this is only one side of the coin. In everyday judgment, 
nothing may be more common than mixing up objective (stimulus- 
directed) and subjective (self-, person-directed) judgments; and 
Ss may be incapable in the long run of fulfilling the psychologists 
demand to separate the two. The hypothesis states that Ss start 
out obeying the instructions more or less carefully, but end up 
responding in their own favorite manner. Note that the concept of 
social desirability set is placed in a more general perspective by 
this reasoning: it is supposed to be one manifestation of a tendency 
to mix up subjective and objective ways of judging. 

One way to falsify the mixture hypothesis is to disprove the 
basic assumption that desirability indices are distributed more flatly 
than applicability indices, The reasoning behind the assumption 
is as follows: the variance of item indices is a function of the 
variances of ratings assigned by the individual Ss, and of the 
intercorrelations between Ss over items. It seems plausible that the 
correlations between Ss should be higher when desirability 18 
rated—a trait may or may not apply, but there is a fair amount 
of consensus as to whether it is favorable or not—and with regard 
to individual variances, it may be argued that these should also 
be higher in the desirability case, which is to say that the cor- 
relation between stimulus scale and response scale (Johnson, 1952) 
should be higher when desirability is rated. All these assumptions 
however, can be tested directly. Another way of disproving the 
hypothesis is to show that, at a certain point in time, applicability 
indices get distributed even more flatly than desirability indices 
of the same traits: such a finding would contradict the idea p 
common limit (Ss’ favorite way of responding) that is vital to e 
hypothesis. However, no data are presently available to the 
author to carry out these tests. 

To appreciate the sequence effect found in this study, 4 
interpretations should be ruled out that may seem adequate e 
first glance. (a) The contamination of desirability indices by tha 
of the foregoing item is not an inertia effect (laziness on the en 
of S to move his pencil from, say, the “very desirable” to | 
"very undesirable” position) ; inertia cannot explain why the 


certain 


neu- 


___@26¢& °°  *«  - 
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tral traits were most susceptible to the sequence effect. The median 
(which minimizes the sum of absolute distances, see Chernoff and 
Moses, 1959, p. 319 4.) of the distribution given in Figure 1, is 
at 2.71; the mean, which minimizes the sum of squared distances, 
is at 2.87. If inertia were operative, the sequence effect should 
be minimal in the neutral region because it is reasonable to 
hypothesize that the smaller the average distance between the 
preceding traits and the trait itself, the less inertia should be in- 
fluential. The curve representing the sequence effect in Figure 2, 
however, has a maximum at 2.85. (b) AL theory would predict a 
contrast rather than the assimilation effect found in the data: an 
undesirable trait should make the next one appear more desirable, 
etc, Only if anchoring or self-anchoring may be assumed to take 
place, as in Parducci and Marshall's (1962) experiments, could 
the sequence effect be explained in terms of contrast by AL theory; 
in the present case, however, such a construction seems remote. 
Moreover, AL theory tends to give equal weights to all preceding 
stimuli, which is certainly inadequate here, since even the second 
preceding trait was not influential (c) Cognitive theories stating 
that the meaning (and, more or less by consequence, the desirability 
of a trait is inferred from its context (e.g. Asch, 1946; Shapiro 
and Tagiuri, 1958), would appear too powerful here, viewing the 
absence of qualitative context effects. 

What takes place in these rapid successions of judgments is 
probably on a more automatic level: some very short-lived 
carry-over of atmospheric properties from one item to the next. 
Each item seems to create an atmosphere of “goodness” or “bad- 
hess,” in varying degrees. This atmosphere is quickly neutralized 
by the next item, but in the meantime it has been effective in 
making that item appear a little more desirable or undesirable. 
That the neutral items are most sensitive to the sequence effect, 
may well be because of their relative freedom to move along the 
Scale, 

As to the generality of the sequence effect, the data are somewhat 
divergent. A bias quite similar to the present effect was found in 
self-ratings (Hofstee, 1966). In an experiment on visual thresholds, 
Verplanck, Collier, and Cotton, 1952, using single stimuli at the 
50 per cent discrimination level, found that response sequences 
Were clearly redundant. While such findings might point to a high 
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degree of generality, the effect is probably weak enough to remain 
subliminal in some cases. Weiss and Moos (1965), for example, 
found no evidence of sequential redundancy (except for first-order 
preponderance of one answer category) in the MMPI; another nega- 
tive instance was provided by Young, Holtzman, and Bryant, 1964, 
who found sequence effects of no importance in peer ratings on & 
list of 180 traits. 

Of the response sets observed in the data, the extremenes 
response tendency (Factor I) especially deserves further attention. 
A structure showing positive relationships among desirable traits 
and among undesirable traits, and negative correlations between 
desirable and undesirable traits, could not result from a put 
extremeness response style, simply because such a tendency docs 
not influence linear relationships. On the other hand, it makes very 
little sense to interpret the corresponding bipolar factor in terms of 
desirability only, e.g. as “preference for desirable traits vs. prefer 
ence for undesirable traits." The proper interpretation derives 
from a fusion between the concepts of desirability and extremeness 
response set, eg. “(differentially) extreme responding to trait 
desirability.” Against the background of the discussion of the 
position effect, where extreme responding was associated with an 
objective (stimulus-directed) as opposed to subjective (self-directed) 
attitude, it might be further argued that persons scoring high 
on the first factor are those who answer predominantly in poe) 
of trait desirability, while persons scoring low are those who 
answer predominantly in terms of trait applicability. i 

While the theoretical status of the method effects as E 
in this study is by no means completely determined, practic t 
implications should already be evident. In the absence of es 
balancing designs, relative desirability indices of traits are i 4 
and sometimes appreciably so: a desirable trait placed at the » 
of a list just after a very undesirable trait may be rated soa 
desirable than another which occurs in a “desirable” pos y 
even though the difference between the “true” values 18 B 
point on the scale in the opposite direction. In the construe! A 
of forced-choice tests, these biases could, in themselves, be ien 
sible for the lack of balance between alternatives which 59 
quently been observed (Edwards, Wright, and Lunneborg, i 
Feldman and Corah, 1960; Hofstee, 1966); Gordon’s (e.g. 
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tests are among the few in which position effects were controlled 
in the construction phase, while in no case have sequence effects 
been ruled out. 

In studies where desirability indices are correlated with frequency 
of endorsement, the irony of the circumstances is that the two 
effects may cancel each other to some extent. The position effect 
should lower the correlation, because it works in opposite directions 
for the two instructions, while the sequence effect enhances it, 
because it works in the same direction. On the other hand, however, 
if desirability indices obtained from different groups (e.g. male 
and female, patient and nonpatient, Ameriean and British) are 
correlated, the observed relationship represents an overestimate, 
because the two effects operate jointly to inflate the correlation. Just 
how serious this is may depend upon the range of traits studied, 
among other factors. 

Finally, the abundance of response set variance in the data 
should make one doubtful of arguments like Jackson’s (1964, 
p. 235) that desirability ratings are relatively free from systematic 
response biases and may therefore be used as personality tests. 
Given individual differences in social desirability, it seems pref- 
erable to measure these viewpoints by means of forced-choice 
formats, which are free from response sets—other than personal 
desirability. While the forced-choice format precludes the appear- 
ance of response styles like acquiescence and extremeness set, 
forced-choice tests have consistently been found to be sensitive 
to personal conceptions of the desirable in self-description tasks. 
In the explicit measurement of personal desirability, however, this 
sensitivity would seem to be advantageous rather than harmful. 
Thus turning the forced-choice format’s most predominant weak- 
ness into a strength is a line of research that seems worth pursuing. 


Summary 


Three kinds of method effects in desirability ratings of trait- 
descriptive adjectives were studied by counterbalancing the posi- 
tion of items in the list as well as the sequence in which items 
followed one another. A position effect was clearly present: desirable 
items were rated progressively less desirable, and very undesirable 
items less undesirable. A sequence effect was demonstrated: items 
were rated more desirable as the preceding item was more de- 
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sirable. Response sets were found to be influential in determining | 
individual differences in the conception of desirability, especially 
a variant of extremeness response set. Several explanations for 
these findings, such as adaptation were ruled out, and new hypoth- 
eses were presented. Some practical implications of the findings 
for studies using desirability ratings were discussed. 
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ITEM FACTOR STRUCTURE OF 
THE ADJECTIVE CHECK LIST! 


GEORGE V. C. PARKER ax» DONALD J. VELDMAN 
The University of Texas at Austin 


As one of the frequently used instruments for assessment of 
self-concept through self-deseription, the Adjective Check List 
(ACL) has recognized merit in comparison with other similar 
instruments. Existing scales, as developed by Gough and Heilbrun 
(1965), provide data regarding behavioral tendencies that may be 
useful for research as well as for diagnostic and counseling 
nore (e.g, Heilbrun, 1960, 1961; MacKinnon, 1963; Parker, 
967). 

_ It is becoming clear in psychological research that a thorough 
investigation of dimensions of test behavior related to such vari- 
ables as the self-concept is essential (Foa, 1961; Loehlin, 1961; 
Briar and Bieri, 1963; Scarr, 1966; Parker and Megargee, 1967). 
Interest has not been lacking, but empirical analysis of the item 
factor structure of most commonly used personality measurement 
instruments has not been accomplished because limitations of 
existing computation equipment have made the task prohibitive. 
Consequently, beyond a few studies such as Comrey’s item factor 
analyses of subscales of the Minnesota Multiphasie Personality 
Inventory (e.g, Comrey, 1957, 1958), other work with instru- 
ments containing many items as focused mainly on analyses of 
the factor structure of subscale scores themselves (c.f., Scarr, 
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1966; Parker and Megargee, 1967). The purpose of this study 
is to examine the factor structure of the 300 ACL items in order 
to help clarify the multidimensional characteristics of this in- 
strument. 


Procedure and Results 


Virtually the entire 1965 freshman class of the University of 
Texas at Austin completed the ACL as part of the regular entrance 
testing program which is administered annually by the University 
Testing and Counseling Center. A total of 5017 students were 
tested: 2212 females (mean age — 18.7) and 2805 males (mean 
age — 18.6). The raw data for the ACL were recorded by the 
students on standard IBM true-false answer sheets, which were 
then converted into punched card form by an IBM 1230 optical 
scoring machine. A computer program was written for a CDC 
1604 computer and used to transfer the punch-card data to magnetic 
tape. Subsequent processing was carried out on a CDC 6600 
eat using standard programs described elsewhere (Veldman, 

Three 300-variable intercorrelation matrices were computed: 
males, females, and the total sample. The first twenty principal- 
axis factors were extracted (in order of their size) from the 
phi-coefficient matrix for the entire sample. Examination of the 
successive eigenvalues indicated that far more than 20 factors 
would have been extracted if the usual unity-eigenvalue rule had 
been followed. A varimax rotation of the first 10 principal axes 
was carried out and all loadings with absolute values less than 
0.40 were ignored. Only seven factors remained upon which more 
than two items had their highest loadings. The first seven 
Principal axes were then rotated by the varimax method and, 
considering only loadings with absolute values greater than 0.40, 
bu item was assigned to the factor it loaded most heavily. The 

content? of each of these factors is deseribed in Table 1. 

The next stage of the analysis concerned the comparison of 
factor structures obtained independently from the males and fe- 
males data. Inter-correlation of the 300 item-variables, extraction 
of the first seven principal components, and varimax rotation 
were separately implemented with the male and female data. An 
analytic technique due to Kaiser (Veldman, 1967) was employed 
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to determine the degree to which the varimax structure for males 
had to be re-rotated to achieve maximum test-vector contiguity 
th the varimax structure for females. 

The results of this initial comparison indicated that only five 

' the seven factors were directly comparable in the separate 
farimax structures, but that the structures could be brought into 
Jose alignment by re-rotation. To demonstrate this, the male 
and female struetures were re-rotated by this same procedure 
loward maximum test-vector overlap with the total-sample varimax 
structure. Then the factor comparison procedure was applied to 
these re-rotated, separate-sex structures. Table 2 contains the 
results of this analysis; it is obvious that the factor structures for 
males and females can be brought into almost perfect agreement 
by the use of the total-sample structure as the target. 

A by-product of the factor comparison procedure is a series of 
300 coefficients which reflect the degree to which the male and 
female item vectors occupy a common position in the common 
factor space. For all but a few items, these coefficients had values 
. exceeding 0.90. However, the few items with low values are of 
special interest in that they evidently have quite different semantic 
significance for males and females. These items are listed in Table 


Discussion 
The seven factors described in Table 1, based on the total 
Ample analysis, may be of value for further study of self-descrip- 
fon and self-concept organization. Their relationships to other 
ndings in this area will be examined below. 


y TABLE 2 
i Coefficients of Factorial Similarity for Re-rotated 
Male and Female Factor Structures 
Females 

E or n in aay Y vu â vw 

tt 1.00 .02 .00 Qo 0 -.0 .02 
X3 n —.02 1.00 —.01 .02 .01 l3  —.04 
CON -00 i0 —1.00 l1  -.01 ‘02 —.0 

y .00 | —:02.. —.01.,..1.00115—:08 .03 .00 

VI mu Next .00 ‘03 — 1.00 03  —.06 
p VII .0  —l3  —lo 3-0 10 .03 
poene cim 02 .04 .01 -00 006  —.0 1.00 
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TABLE 3 


ACL Items with Dissimilar Locations for Males and 
emales in a Common Factor Space 


Similarity Item 
.06 147. masculine 
.28 86. feminine 
-60 74. effeminate 
.70 194. rattlebrained 
Tl 105. handsome 
-76 273. unaffected 
.78 189. queer 


78 285. unrealistic 


penes celi Ln LH 52080 UNI at te dd NND SSE 


Factor 1, indentified as Social Facilitation, is defined primarily 
by 28 adjectives describing characteristics having a high degree of 
social favorability, and which seem to indicate maturity, respon- 
siveness, ease and awareness in regard to interpersonal dimensions. 
Three-quarters of these items occur both on the ACL Favorable 
items scale and Nurturance scale; 17 occur on the Affiliation 
scale; 12 are contraindicative adjectives from the Aggression seale. 
This factor closely resembles Scarr’s (1966) factor 1 (Social 
Desirability) and Parker and Megargee's (1967) factor 1 (Positive 
vs. Negative Evaluation). 

Factor 2, labelled Interpersonal Abrasiveness, contains nine 
adjectives describing characteristics which would largely tend to 
irritate, annoy, or offend others, and which would generally be 
regarded as undesirable social attributes. Seven of the nine items 
are included in the ACL Unfavorable items scale. 

Factor 3, called Ego Organization, consists mainly of 20 ad- 
Jectives defining internal stability, dependability, and achievement- 
Oriented characteristics. Thirteen of these adjectives are in the 
ACL Favorable items Scale; 11 appear in both the Endurance 
and Order scales; and seven occur in the Achievement scale. This 
factor appears to be quite similar to Searr’s factor 3 (Personality 
Traits Associated with Intelligence) and may have specific rele- 
vance in college students for scholastic facilitation and accomplish- 
ment. 

Factor 4, labelled Introversion-Extraversion, and containing 
eight adjectives, was one of the easiest to name because the content 
seems uniformly clear, All eight of the adjectives occur also 0n 
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the ACL Exhibition seale, three being scored as indicative and 
five as contraindicative of this characteristic. Factor 4 appears to 
resemble closely Scarr’s factor 2 (Introversion-Extraversion) and 
Parker and Megargee’s factor 2 (Ascendency vs. Obsequiousness) . 

Factor 5, with 14 primary adjectives, labelled Internal Discom- 
fort, basically defines intrapersonal distress coupled with difficulty 
with emotional control. Four of the items appear on each of several 
ACL scales: Unfavorable items, Lability, and Aggression. 

Factor 6, with three key adjectives, was labelled Intraception. 
This cluster was hardest to label, largely because the concepts 
do not seem to match any well-established psychological concept. 
In fact, “unconventional” is scored as an indicative adjective on 
the ACL Exhibition scale, while “reflective” is scored as con- 
traindicative for the same scale. Taken together with the facts 
that “idealistic” is scored on no existing ACL scale and that all 
three of these adjectives load in the same direction on Factor 6, 
a unitary concept is hard to imagine. If the loading-criterion 
of 0.40 is reduced somewhat, the significance of this factor becomes 
clearer. Adoption of a criterion of 0.30 adds the following six 
items: complicated (.37), impulsive (35), individualistic (35), 
insightful (.36), spontaneous (.39), and unaffected (.32), none of 
Which has a higher loading on any other factor. Together with 
the original three items—reflective, unconventional, and idealistic 
—the theme of the concepts begins to suggest an introspective 
aloofness or cognitive independence that is reminiscent of the 
current “hippie” stereotype. This factor represents an important 
dimension of self-concept—at least for a meaningful proportion 
ot the population—which was not anticipated in the rational 
derivation of the existing ACL scales. 

In view of the possibility that the data from this study may be 
used for the derivation of new “factor” scales, and that scale 
reliability and length are related, it would be appropriate to 
Include the additional six adjectives in future work with a scale 
pu on this factor. In such a case, it may be equally appropriate 

relabel the concept “Cognitive Independence." 

Factor 7, called Social Attractiveness, contains 10 adjectives 
Which describe social smoothness and flair associated with a 
component of heterosexual effectiveness. Four of the adjectives 
"re also part of the ACL Heterosexual scale, and three appear on 
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fhe 
both the Favorable items and Exhibition scales of the ACL. 


This factor, together with factor 4, has some similarity to Parker — 


and Megargee's factor 3 (Emotionality vs. Stolidity), partic- — 


ularly with regard to the “emotionality” component. 

It is worth noting, and is somewhat reassuring, that most of the 
300 ACL adjectives have a common “meaning” for both male 
and female Ss. This semantic commonality is demonstrated most 
clearly by the fact that only a small number of items (Table 8) 
have markedly dissimilar locations in common male-female factor 
space, indicating that these concepts were used in different contexts 
by males and females in this sample. The conclusion seems wat- 
ranted that they do have different meanings which are sex-related. 
It is noteworthy also that most of these items have sex-referents. 
This can be seen with “masculine” and “feminine,” the two ad- 
jectives showing the greatest dissimilarity in the ways they were 
used by our male and female Ss, and is also true of “effeminate” 
and “handsome.” The implication is that when a female describes 
herself as “masculine,” for example, she is saying something quite 
different about herself and not simply the opposite of whab the 
male means when he endorses this adjective as self-descriptive. 
The related finding that the concept “queer” is used in relatively 
different contexts by males and females may be accounted for by 
its referents “odd” vs. “homosexual,” with the latter meaning 
apparently most often implying male sexual inversion. In any event, 
the essential significance of the data in Table 3 may be to under- 
score the need for caution in the application of rationally-derived 
scales in the absence of empirical validational data. That it is 
a very difficult task to predict a priori sex differences in the 
meanings of assessment materials, such as ACL adjectives, may be 
seen in the finding that, while “unaffected” was used in dissimilar 
contexts by males and females in this study, “affected” (another 
ACL adjective) showed no corresponding dissimilarity. J 

In sum, the results of the item factor analyses of the ACL show 
that the factor structure is remarkably invariant between males 
and females. Commonality was found in such major behavioral 
dimensions as social maturity and desirability, interpersonal offen- 
- siveness and abrasiveness, internal stability and ego organization, 

intraindividual distress and internal discomfort, and social in- 
. troversion-extraversion, Also notable is the strong component of 
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social favorability vs. unfavorability within the general factor 
structure. 
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CLASS CONCEPTS! 
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Wuen trait variables in a factor analysis are broadly sampled 
from the universe of human behavior, in such a way that each 


trait logically has little in common with any other in the analysis, 
if each trait is sufficiently represented, even in an orthogonal solu- 
tion, we may expect to obtain reasonably clear simple structure. 
Factor analyzing measures of factors in diverse personality modes, 
eg, carefulness, mechanical interest, word fluency, and anxiety, or 
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even different factors of the same mode, e.g., verbal comprehen- 


| sion, perceptual speed, spatial visualization, and originality, is at 
Present little more than an exercise. 

A really demanding task is to see whether such trait variables 
that are narrowly sampled from the trait universe can be demon- 
strated as independent constructs, whether or not they are em- 
bedded within a theoretical network. This study focuses upon 

intellectual abilities involving class concepts—a very narrow sample 
of intellectual traits. 

Guilford’s (1959) theoretical model, the structure of intellect 

| S, calls for 20 such abilities, one for each of five categories of 

Peration within each of four categories of content. Before this 

| study was well under way, 12 of the 20 abilities were believed to 
ave been demonstrated factor analytically. The study reported 


1 
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here sought to extend the number of identified abilities in the classes 
domain, while replicating eight of the 12 previously demonstrated 
classes factors. 


The Factors and Their Tests 


To facilitate discussions of the classes abilities here and later, 
"Table 1 is provided. It extraets from the total SI model the horizon- 
tal layer containing the abilities pertaining to classes, and it con- 
tains 20 cells to represent the 20 classes abilities mentioned earlier. 
Each cell has its unique trigram symbol for its particular conjunc- 
tion of operation, content, and product, mentioned in that order. 

The nature of the factors to be expected is inferred from their 
places in the SI model. The properties of each ability are specified 
by its conjunction of values on the three parameters—operation, 
content, and product. For parallel factors there should be parallel 
tests. Such parallels will be pointed out as the pertinent features of 
the tests are described. 


Tests for Cognition of Classes 


Cognition means awareness or comprehension of information. 
Test items that indicate whether or not examinees have possession 
of certain class ideas or class concepts are sufficient to tell us about 
their characteristic levels on scales of cognition-of-classes abilities. 
Certain types of tests have been used and found to be discrimina- 
tive among individuals along the cognition-of-classes dimensions. 
Not all types have been used with all kinds of information (con- 
tent). 

Commonly used for cognition have been tests of the “exclusion” 


TABLE 1 


A Matriz of Abilities Pertaining to Classes, 
with Variations of Content and Operation 


Figural Symbolic Semantic ^ Behavioral 


Cognition CFC csc CMC CBO 

Memory MFC MSC MMC MBC 

Divergent DFC DSC DMC DBC 
Production — 

Convergent NFC NSC NMC NBC 
Production 


Evaluation EFC ESC EMC EBC 
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type in which four or five potential exemplars of a class concept 
are given in each item, one of which does not belong to the class 
and the subject (S) is to identify it. Three such tests were used 
in this investigation, although they were known not to be among 
the most successful tests in previous analyses. They had correlated 
lower with other tests of their respective factors than is usual. The 
reason was revealed by the results in this analysis. The three ex- 
clusion tests were Figure Exclusion, Letter-Group Exclusion, and 
Word Grouping, for factors CFC, CSC, and CMC, respectively. 

Another variety of cognition-of-classes test may be called an 
“inclusion” test, for it asks what single unit of information be- 
longs in a class that is to be identified by S from a set of two or three 
exemplars. The single unit is to be selected from five alternatives. 
The inclusion tests used in the present analysis were Figure Class 
Inclusion and Letter Classification for the factors of OFC and CSO, 
Tespectively. 

A third kind of test in this area is in matching format. Four sets 
of three exemplars each, each set forming a class the concept of 
which is to be cognized, are presented along with five alternate 
potential exemplars. Figure Classification, Number Classification, 
and Verbal Classification are in this category, for factors CFO, CSC, 
and CMG, respectively. Verbal Classification differs from the oth- 
ers by having only two classes, each represented by four exemp- 
lars, with eight words to be put in one class or the other, or in 
neither, The restriction to two classes and the addition of the 
neither” alternative might be expected to involve some complica- 
tions, and we shall see that this is so. 

A fourth kind of test was used in connection with factor csc 
only, Number-Group Naming presents a set of three numbers in 
each item, with S to name the concept or otherwise to verbalize it. 
Such tests in the other content categories have been found loaded 
on factor NMU, originally called a “naming” factor but later rec- 
OBhized as the convergent production of semantic units (NMU), 
in the SI model. Evidently the step of naming the class con- 
"ps in such tests is ordinarily more difficult and is a heavier 
contributor to variances in scores than is the step of cognizing the 
"lass concept, hence the significant loading on NMU. Apparently 
m case of Number-Group Naming the reverse is true, for it 

as a history of loading on the CSC factor rather than on NMU. 
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Tests for the Memory of Classes 


"The four tests for the memory-for-classes abilities, two for each 
factor, were in the category of marker tests. Two principles are 
represented in these four tests. In three of them, S studies sets of 
three exemplars each on the study page, in each of which some 
particular class concept should be readily cognized, so easily that 
there should be no cognition variance in the test scores. The re- 
tention test that follows immediately, offers for recognition not the 
same exemplars but the same class concept represented by new ex- 
emplars. The replicated classes are of course mixed with negative 
instances of similar classes. 

Memory for Nonsense Word Classes presents for study sets of 
trigrams, such as GID, VID, JID, with the correct set for recogni- 
tion such as ZID, FID, NID. Memory for Word Classes is much 
the same, using familiar words instead of meaningless trigrams, the 
class concepts being dependent upon spelling features, hence both 
were markers for factor MSC. Classified Information is in similar 
format, but the classes depend upon common meanings of the 
words, hence it represents factor MMC. A set that was studied 
might contain: SILK WOOL NYLON, and the set to be recognized 
might be: RAYON COTTON FELT. 

The other marker test for MMC was Picture Class Memory, 
in which the studied sets are made up of three pictured familiar 
objects, such as articles of clothing used for keeping warm in cold 
weather. The set given for recognition contains pictures of such 
clothing with two exemplars. One of these exemplars is identical 
with one in the studied set, given along with a new one that rep- 
resents the class well. A mislead alternative presents another ex- 
emplar from the same studied class, but it is paired with an ex- 
emplar representing some other class. 


Tests for Divergent Production of Classes 


Divergent production rests upon the recall of information from 
memory storage to satisfy certain needs raised by test items. The 
test items are “open” in the sense that many different responses Bre 
relevant and more or less appropriate. The scoring of such tests 
emphasizes quantity of production and variety. Three general prin- 
ciples are represented in the nine tests used for factors DFC, DSC, 
and DMC. 
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The production of classes means the construction of groups, put- 
ting exemplars appropriately into those groups. This holds true for 
either divergent or convergent production. The difference between 
the two operation categories is that divergent classification involves 
multiple ways of grouping items of information, whereas conver- 
gent classification is hedged-in with sufficient restrictions so that 
only one class will do. In divergent classification, the same item of 
information appears in more than one class; in convergent classifi- 
cation, an item of information is an examplar of only one class 
concept, unless the rules specify other conditions, still restrictive. 
A regrouping activity of some kind is a natural one for a test of 
divergent production of classes, a principle that is applied in two 
ways. One way is to present a limited list of units, each of which 
has several attributes, each of which is held in common by some 
other unit in the list. For example, Alternate Letter Groups gives 
eight capital letters from the alphabet, with S to group and regroup 
them in as many ways as he can in terms of their appearances. 
Although letters are used, it is the figural properties that provide 
the basis for classification, hence this test is for DFC, not DSC. 
Another DFC test, Multiple Grouping of Figures, provides nine fig- 
ures each containing some geometric attributes. 
Two tests for factor DSC also employ the same regrouping 
Principle. Multiple Grouping of Nonsense Words provides a list of 
10 letter groups. Name Grouping presents lists, each of nine given 
names, to be grouped and regrouped in terms of certain letter or 
letter-combination properties. One semantic test for DMC, Mul- 
tiple Grouping, follows the regrouping principle by giving a list of 
E well-known words to be regrouped in terms of their mean- 
gs. 
À second regrouping type of test has the variation of giving one 
Set of three units that has a number of common attributes making 
it a candidate for grouping with different units selected from a list. 
ae Figural Similarities presents a set of three figures and a 
A ten other figures, with S to find a number of such units that 
€ classed with the set, each for a different reason. A parallel 
a factor DSC, Multiple Letter Similarities, gives a set of three 
bo Eroups and a list of other letter groups, with S to select one 
E. a turn to classify with the given letter-group set. ! 
actor DMG, a quite different principle has been applied. 
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The “shift” score from the Brick Uses Test is a count of the num- 
ber of times S changes category of uses, e.g., going from a briek 
as building material to uses as a missile, a weight, a marker, and so 
on. Tt is recognized that what S essentially is doing in order to earn 
a good score is to reclassify a brick many times. A multiple-group- 
ing test, which more obviously satisfies the SI specifications for 
DMC, helps to define the same factor. In this study, two tests for 
DMC are based upon the shift principle. Utility Test includes the 
activities of listing uses for a brick and for a common wooden lead 
pencil. Alternate Uses asks for listing of unusual uses of a number 
of familiar objects, the ordinary use being excluded. This condition 
almost automatically entails changes of category with every re- 
sponse. Every use is likely to reflect a different attribute of the 
object. 


Tests of Convergent Production of Classes 


As stated earlier, in the convergent production of classes, Te- 
strictions ordinarily preclude more than one right answer. The 
tests can be similar to those for divergent production of classes 
with some added features of restriction. 

The simplest principle is that of presenting a list of n units that 
are to be classified into mutually exclusive groups. A test following 
this principle calls for the act of partitioning a collection of items 
of information. The test, Word Grouping, presented a list of 12 
familiar words, with S to form four classes, no word to be in more 
than one class and every word being classified. Completely parallel 
tests were developed in the form of Figure Grouping for NFO 
and Letter Grouping for NSC, the units in the latter instance 
being trigrams. 

A modification of the partitioning type of test just described 
presents a list of nine units—nine figures, letter groups, or words— 
to be grouped by threes, so that a given single “target” unit, OT 
model, can belong to each of the three classes in turn. For example, 
a lone square can be grouped with three quadrilaterals, with three 
figures containing parallel lines, and with three figures containing 
right angles, as in the test Figure-Concept Grouping. Letter-Concept 
Grouping and Concept Grouping are similar tests for NSC and 
NMC, respectively. 

Another modification of the partitioning type is seen in GrouP 
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Classification, for NMC. Each potential exemplar is composed of a 
set of four words, one of which is the “attribute” common to 
words of similar meaning in other sets of four. Eight such sets are 
given, with two additional target sets to serve as models for the two 
classes that are to be formed from the eight. 

Another controlled-grouping test permitted using each unit two 
times (and only two). Given a list of six units of one kind of 
content, S is to form two sets of two classes each. This is true of 
Restricted Figural Classification, for NFC, and for Restricted Sym- 
bolic Classification, for NSC. No corresponding test form was 
used for NMC. 

One test of the partitioning type presents nine words, with S 
to separate them into two classes so that one of the classes shall 
be as large as possible. Largest Class, for NMC, is of this type. 

Finally, as an attempt to design a quite different type of test and 
With some desire to test the degree of generality of factor NFC, 
the test Figural Hierarchical Grouping was designed. Given a list 
of either seven or 15 complex figures in an item of this test, S 
is to find the most general case, two major classes under it, and two 
minor classes under each of the major classes, when seven figures 
Bre presented, and two sub-classes under each of the minor classes, 
1n addition, when 15 figures are given. Such a complex test might 
Well be expected to load on more than one factor, and the results 
did show a substantial loading on another factor? 


Tests for The Reference Factors 


Tests of four reference factors outside the category of classes 
abilities were included in the analysis, with the expectation that 
they would possibly contribute significantly to variances in scores 
from some of the classes tests. A completely adequate set of ref- 
erence factors was impossible to include because of the limitation 
of available testing time. The reference factors and their marker 
tests are listed below. 

CMU—Cognition of semantic units. Verbal Comprehension, & 
v tüple-choice vocabulary test, and Word Completion, a defining 
Ype of vocabulary test were employed as measures of CMU. 
ERES 


E "n 
t more complete descriptions of tests, including sample items, see Dun- 
» Guilford, and Hoepfner (1966). 
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CM S— Cognition of semantic systems. Problem Solving, an arith- 
metical-reasoning test, and Ship Destination, a test with problems of 
increasing complexity, were selected to represent the general-rea- 
soning factor or CMS. 

DSU—Divergent production of symbolic units. The word-fu- 
ency factor was to be measured by Suffixes, a test calling for the 
produetion of many words with particular endings, and Word 
Fluency, calling for the production of many words with specified 
letters. yw 

NMU— Convergent production of semantic units. Naming Mean- 
ingful Trends, a test of naming a semi-ordered trend of objects, 
Picture-Group Naming, a test of labeling concepts defined by 
groups of pictures, and Word-Group Naming, a test of labeling 
concepts defined by groups of words, were employed to measure 
the naming factor, NMU. 

It will be noted that the last two titles suggest that they refer to 
classes tests. That had been the original intention in their construc- 
tion, but experience had shown that they are strongly related to 
factor NMU instead. In this study there was another, and better, 
opportunity to see whether they have any classes variance. No 
tests had been especially designed for NMU when this investiga- 
tion was conducted. Such tests will probably involve neither rela- 
tions, as in trends tests, nor classes. 


Procedure 


Test Development 


Sixteen new tests were constructed, employing the SI model in 
two ways. First, specific examples of tasks were deduced from 
the operation-content-product combinations. Second, tasks were 
devised by analogy to those that had proved to be successful for 
Parallel SI abilities, those having one or two categorical attributes 
in common. Thus, a test for factor NSC could be written similar 
to a test for NMC, the only difference being the kind of content, 
or similar to one for DSC, the only difference being the n p 
operation. Many examples of such parallels can be seen 1n eu 
discussion of classes tests earlier. Twenty-seven tests were sclected 
from the list of recommended tests for identified SI factors pIO" 
vided by Guilford and Hoepfner (1963). New items were written 
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for three of these tests—Figure Classification, Figure Exclusion, 
and Word Classification, with attempts to increase reliability and 
univocality. The names of three tests were changed in order to 
make their labels more indicative of the abilities they measure. 
Sentence Evaluation was changed to Sentence Classification; Let- ' 
ier Grouping to Letter-Group Exclusion; and Seeing Trends I to 
Naming Meaningful Trends. í 

The new and revised tests were pretested to determine clarity of 
instructions, appropriate difficulty levels, test reliabilities, and op- 
timal time requirements. In this process, the tests were administered 
to samples of junior-college and university students. 


Subjects 


The subjects for factor analysis were 177 male and female, ju- 
nior and senior students, at a high school in a middle-class urban 
area in Southern California. 


Administration and Scoring of Tests 


Administration of tests took place in a large auditorium. The 
total time required was nine hours, divided into three sessions of 
three hours each, on different days. The juniors were tested 
during one week and the seniors the following week. The tests 
Were presented in nine printed booklets, with the restriction that 
" tests for the same factor should not appear in the same book- 
et. 

All tests were scored by hand and independently checkscored 
by a different scorer. Scoring formulas were applied to multiple- 
choice tests to correct for guessing. 


Analysis 


Table 2 presents the statistical information concerning the test 
Scores, including means, standard deviations, and estimates of re- 
liabilities, The score distributions of four tests were too severely 
skewed or truncated to meet the requirement for computing a Per- 
Son r, so they were dichotomized near their medians. Those tests 
Were Concept Grouping, Memory for Nonsense Word Classes, 

tter Grouping, and Restricted Figural Classifications. For all 
tests With two or more separately timed parts, Spearman-Brown 
Adjustments of inter-part correlations were used as estimates of 
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TABLE 2 
Means, Standard Deviations, and Reliabilities of Test Scores l 
Test Name and Code Mean Standard Fn 
Deviation | 
1. Alternate Letter Groups DFC03B 17.12 3.76 n 
2. Alternate Uses DMC03C 12.92 5.37 81 l 
3. Classified Information MMCO1A 42.50 14.27 .18 
4. Concept Grouping NMC02A- 15.46 3.35 «10 
5. Figural Class Inclusion CFC04A 11.29 5.17 .09 
6. Figural Hierarchical Grouping NFC02A. 6.93 5.07 +15 
7. Figure Classification CFCOVA 9.95 4.03 61 | 
8. Figure-Concept Grouping NFC03A 12.23 3.79 TO 
9. Figure Exclusion CFC03A. 15.10 4.30 46 
10. Figure Grouping NFCO1A 52.72 13.61 .80 
11. Group Classification NMC03A 16.27 7.29 3 
12. Largest Class NMC04A 3.07 1.76 54 
13. Letter Classification CSCO6A 11.66 4.73 5 
14. Letter-Concept Grouping NSCO2A 8.62 4.77 8 
15. Letter-Group Exclusion CSC01B 21.39 5.64 66 
16. Letter Grouping NSCO1A 58.02 12.11 8 
17. Memory for Nonsense Word Classes MSC02B — 5.67 3.09 -82a 
18. Memory for Word Classes MSCO4A 22.12 10.45 T6 
19. Multiple Figural Similarities DFCO7A 10.64 2.49 E 
20. Multiple Grouping DMCO2C 7.53 1.67 Ls 
21. Multiple Grouping of Figures DFCOSA 13.98 3.18 eH | 
22. Multiple Grouping of Nonsense Words DSC05A 5.01 2.58 51 
23. Multiple Letter Similarities DSCO4A 10.72 3.36 66 
24. Name Grouping DSC02B 8.02 2.60 Hn 
25. Naming Meaningful Trends NMUO4A 3.96 2.29 3 
26. Number Classification CSCOSC 9.88 4.80 UE 
27. Number-Group Naming CSC05B 7.44 2.41 “era 
28. Picture Class Memory MMC03B 14.98 4.90 Ss 
29. Picture-Group Naming NMU03A 4.56 1.78 "na. 
30. Problem Solving CMS05A 4.22 331 T e 
31. Restricted Figural Classifications NFCOJA 15.15 8.98 o 
32. Restricted Symbolic Classifications NSC04A 11.58 9.57 m 
33. Sentence Classification CMC03A 17.53 6.49 "gTa 
34. Ship Destination Test CMS02D 9.38 6.69 63b 
35. Suffixes DSUOIA. 12.44 4.51 
36. Utility Test DMCOIA 14.99 5.97 A 
37. Verbal Classification CMC02B 43.47 15.54 “160 
38. Verbal Comprehension CMU020 10.44 4.31 “520 
39. Word Classification CMCOIB 12.82 3.33 LA 
40. Word Completion CMUOIB 10.14 3.33 ; 
41. Word Fluency DSU02A 42.27 9.94 t 
42. Word-Group Naming NMU02A. 11.36 2.56 “50 
43. Word Grouping NMCO1B 41.59 5.20 2 
44. Sex .45 .50 


* Kuder-Richi 


jardson estimate of reliability. 


b Obtained communality as a lower-bound estimate of reliability. 
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reliability. Kuder-Richardson estimates were computed for all one- 
part tests that showed no material evidence of speeding. For one- 
part, speeded tests, communalities are given as lower-bound esti- 
mates of communalities. 

A matrix of intercorrelations of the 43 tests and the variable of 
Sex was computed. For variables that had been dichotomized, 
point-biserial and phi coefficients came from the computer. Cor- 
responding Pearson r’s were estimated by applying appropriate 
formulas (Guilford, 1965, pp. 324-354). 4d. 

For the extraction of factors, estimates were made for the com- 
munalities of the 44 variables, using the multiple-R squared, then 
iterative extractions were made to obtain better estimates, assum- 
ing 16 factors (the number hypothesized). Iterations were con- 
tinued until no communality changed more than .05 in going from 
one cycle to the next. 

The axes were rotated orthogonally to psychological interpret- 
ability by means of an analytical procedure developed by Cliff 
(1966). This procedure provides a least-squares fit of a matrix to a 
Specified target matrix. The first target matrix was constructed by 
giving each variable a loading equal to the square root of its com- 
munality for the factor on which it was expected to appear after 
Totation, with other loadings being set equal to zero. The result- 
ing rotated matrix indicated that some variables would not meet 
satisfactorily their targeted major loadings. The target values for 
further rotations were changed accordingly in the next target ma- 
trix. In this manner a sequence of rotations was carried out, in 
Which the criteria of positive manifold and simple structure were 
also applied, 


Results 


The rotated factor matrix appears in Table 3. The interpretation 
of each of the 16 rotated factors is based upon the apparent factor 
Content of the tests loaded significantly (.30 or higher) upon the 
factor, The test loadings for the factor in question are listed, along 
With any additional significant loadings of the tests, where tests 
Proved to be factorially complex. Each test name is preceded by 


Dm 
(Dee Correlation. and principal component matrices are presented elsewhere 
am, Guilford, and Hoepfner, 1966). 


620 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


z YI $0 (09 yl 9I Oz 25 uos wrejqoxq “OF 
49 80 02 I9 Fr £0 26 OF 80 Of OI E 80 £0 80 9% VI Bumen dnoip-ezmoup "6g 
o IZ s um  Y0— 4 10 OF w- IG OF I £0 W- 20 Gt Y Aiowejq SUID ezmjotq “RZ 
89 97 *0 16 pA 10— S0 £0 [41 oz SI ez 9% 4t Pad es 80 SurmueN dnoij-:qumN| `Lg 
EEMYO ESO "^er SI WO — Ib Se. OI 3 SD - 30: — 1T orn XE Bt 99 87 wonwogmew[) GUNN "92 
62 90— OL OF  Z0— $0— 6I 60  $0— FE SO II— 60 $1 (00 OF z= spueipqmjdurrve]y SumuwN “gz 
Hodie -L0— 6f (80  20— t€ | 68 8985 -t0^ !|6p3 AU XI SEL et OL SI Zurdnoi OWEN "yc 
Sv 60— If 240 2  10- et 2e 00 8 80—- Sf Of 90 I 6 20 senuwpung 19449 ANMWE "ez 
89 OI Of  t0- $C Of GE If (6L £0 40 £0 St 4 50 og 70 SPIO 9$uosuoN| jo Surdnoi) e]dn]n]N “gZ 
00.00 007 gI 8T. ;$1- 08). 20) "It OF: OT. _ 6L SO "ei ec 16 som3ij jo Buidnoig epdn]n]N "Iz 
69 *6— A21 $1 4o- t6 29 st 70 Zi 9— W— St  20— £0 or t0- Surdnoip edninW “07 
S$ 90 48- 80—- O 10 Il ec 0 Se OL- g0—= 90 8I- 00 9098 sonnepag emita ANMWE "6r 
89, Imene -II—' 10— -91—.—90 o SI. Lt st. 29 4704 er CL RO IG 9I So89V[) PIOM 10 A10WOW “BT 
96 w- 90 ct 60 6l 40 6 0O (00 St zg 10- a OI £r 06 SOBSV[O PIOM ƏSUƏSUON 10] A10U09]N "ZT 
06 02 eG; & (T9 €& OF 6c $80 % $0 8% 80- SI 8 6r £0 Burdnoip 190491 "9r 
99 a SI I$. EO 2l 96 21 90 OT $0 ttt Oz 6r wormpoxg dnog- "Gr 
Sg £0— 8 £I Of 90 6O- $6; 101 0 SI ef St OL OF 62 ez Sujdnoip 3deouoD-1exjeT “PI 
£9 10— 6I OF 8 G0 24 60 ZI 6| I æ- Oz Wt e €T WONVOgISSW[) 1e39] "gr 
89 II— ve SG 60 ST GO 16 OI SO £l 2 6 95 VI 10 2z Ssw]) 459310] ‘ZT 
yo 10- $$  $0— ff Of SI st IT 20 £$0— 10 S0 hm 60 ææ 60- wonvogrew[) dnoi) “IT 
68 £0 90 “Oo 22 t9 80 90— v0 tl 40 1% 10 60 OF ge 9c Surdnoip om3 “OT 
19 1£(— It 60 OL sf 60 vi £0 4 Y0 OI 8 16 210 I6 + wommpxy ONT '6 
LL er St. I 46 (09 EIL 91 Yd ot Sr et 95 i SI 16 80 Surinoip 4deouop-emnZip 'g 

vo 0 go SI 90 I0 Ii £0 £0— 9% gi- GE M & sl S2 LF OnvogiSSw[) BINT "4 

19 I0 tI  T10- Sc Ww Or 60 - 90 IF co 6 Of IJ o g Baydnorp peorgorwierH pemn ^g 

89 SI sz * SI It 80 st 00 or 97 Sc 9r 9c 90 2i OF uornpou[ ssv[) eming ‘9 

SL T~ 4£ 92 Of st g  90— 6 90 90  10- 90 20 oF 9% oz Suyinoi) 4deouo)) “p 

09 Of 4I It (Ot S= Zi 90  90- tt OF 0t St $9 <er $1 I0— wopnsuojyu pogiew|) “g 

6& to~ 90 62  10— SO 9k 80 x OF SI H IO ZI £I 90 ZI sos eeureNy “Z 

$9 wW- 90— SI 60 Mo Of ef S0 Z9 IW we £c E *0- ef 00 sdnoip 19491 oyeureNgy `I 

«4 XAS OWN AWN OSN OAN OWA OSA NASA JAA OWW OSH SWO OWS AWO OSD O49 9ureN JL 

TUDJE 101294 PANOH 
£ TISVL 


627 


“Porro syurod qeuroeqq—o40N. 

19 6  9I— II  $0— L — 6I- £0— eI — 14&- S6 t0 S~  v0- g6 xag p 
pi. 19 90- t$ or 6€ Ste 090 -10— SL 90. 4$ OT) OF Sie yr Or à Zudnorp proa “ey 
- 9 80 1% 9$  10- 1€ OL Gt st % Y $0 OF 6 $8 WO SI Buey dnoi-p10A| ‘ZP 

*4 OI- OL Of FO £0 OF (6c 99 Sf OI GO 60- 2 zt 6r zI 4ouan|j PJOM "Ty 
zi 4L 10- S6 08 90 60 I6 Z ef 00 40 OI & 6 Z9 $1 $0 wonepdvio) PIOM “OF 
y 46 10 2% 80 10 FE Of æ 6l 0- II  I10— 9 $8 Gs g OF OnvogiSSV[) PIOM "6g 
Š vL 10- 9l 2i go 60 80 oz 10 0 4l OD cc 6l 89 ID GI womuoqeiduro))|vqueA "gg 

14 t0 98 sz FO 90 OC St €c0— Vl GO Sz gZ «$e Yo Z Wc WOnsogssw[) [wquoA “LE 

08 9t — 80— ST 60 00 £L 62 Of £0 ZIE 90 60- zl at zi- 20 3IL ANNA og 
4 

$9 00 16 II et £0 OI 6 £9 SI @I—- vI I St 60 0I 60— sexpng ‘gg 
Ao o $0 6 oF sc Et 90- SI Iz OL w ge c 60 £z 4591 uonvunseq dyg "yg 
Xj R *0- ZT 90 (60 090 90- ET st TO 4z £1 60 *» et £c sI WOnvogISW[) 9ouejueg “Eg 

IL @- 66 10 24 11 Ot tI- s&t 240 O  10- 2 20 W- Le 9 Suonsogmsw[) ojoqiAg peiornsey “Ze 
> S9 60- 00 4l 6 60  £0— £6 IC ££ SI 90 + GE 60 6 SI suonvogissw[) [VIN Pasy "Ig 


penunuo5—eg ATAVL 


628 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


its number in the battery and is followed by the trigram for its 
hypothesized factor. 


The Classes Factors 
CFC—Cognition of figural classes 
7. Figure Classification (CFC) 47 
5. Figural Class Inclusion (CFC) A6 
9. Figure Exclusion (CFC) 44 (.38 NFC) 
6. Figural Hierarehieal Grouping (NFC) 42 (41 NFC) 
10. Multiple Figural Similarities (DFC) .35 (35 DFC) 


The three leading tests on this factor had been hypothesized to 
define CFO. The third test, Figure Exclusion, however, is complex, 
having a loading of .38 on NFC. New consideration of this test 
suggests that its items are somewhat like partitioning tests for 
NEC, in which a list of items of information are to be segregated 
into mutually exclusive classes. In the exclusion type of test, S 
is actually to form two lasses, one of them containing only one 
exemplar and the other containing three or four, as the case may be. 
From this point of view, the convergent-production variance in 
this and other exclusion tests is reasonable. 

In the development of classes tests in the categories of divergent 
and convergent production, efforts, were made to control cogni- 
tion variance by utilizing common properties that are readily rec- 
ognized. From the fact that Figural Hierarchical Grouping and 
Multiple Figural Similarities shared their variances with factor 
CFO, we see that those efforts were not entirely successful, for 
both of them share some cognition variance. 


CSC—Cognition of symbolic classes 
26. Number Classification (CSC) 53 
27. Number-Group Naming (CSC) 53 
13. Letter Classification (CSC) E 
22. Multiple Grouping of Nonsense Words (DSC) .30 (5l D50) 


Three of the four tests designed to measure CSC were loaded 
univocally on this factor. The fourth test, Letter-Group Exclusion, 
was loaded univocally on NSC instead. In this instance an exclu- 
sion type of test went entirely on the convergent-production factor 
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corresponding to CSC, for which it was intended. The presence of 
the DSC test, Multiple Grouping of Nonsense Words, on factor 
CSC is another example of how a test designed for a production 
ability has some cognition variance creeping in, but in this case, 
the loading on the corresponding cognition factor is minimally 
significant. 


CMC—Cognition of semantic classes 


83. Sentence Classification (CMC) 44 

11. Group Classification (NMC) Al (43 NMC) 

39. Word Classification (CMC) 38 

87. Verbal Classification (CMC) .35 (.86 NMC) 
3. Classified Information (MMC) .33 (.40 MMC) 


The three tests designed to measure CMC were loaded signifi- 
cantly on this factor. Again we see a test designed to measure con- 
vergent production, Group Classification, with a substantial loading 
ona parallel cognition ability. 

In the memory test, Classified Information, S is presented on the 
study page several sets of three words each, the words of a set 
Sharing a common property. On the test page he is to recognize a 
New set of three words that have the same class property. Since 
S has to recognize the common attributes on both study page 
and test page, there are numerous opportunities for cognition var- 
tance to enter into the scores on this test. Some classes were evi- 
dently not obvious to all Ss. 

Sentence Classification asks S to say whether each sentence is an 
example of a statement of fact, of possibility, or is a matter of 
Raming, With this description, it is apparent that the task is that 
of Placing presented ideas somewhere in three defined classes. This 
kind of task turned out to be univocal for CMC in this test battery. 
There is still a possibility that Sentence Classification has some 
Tlation to the parallel evaluation factor, EMC, for which there 
Were no marker tests in this battery. 


ne MSC—Memory for symbolic classes 
"s un for Nonsense Word Classes (MSC) — .82 
€mory for Word Classes (MSC) 63 
The two marker tests for MSC performed even better than was 
“pected. Because of an unusual degree of similarity between these 
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two tests, it may be that their loadings are somewhat inflated with 
a specific source of variance unique to the two. 


MMC—Memory for semantic classes 


28. Picture Class Memory (MMC) 40 
3. Classified Information (MMC) 40 (.33 CMC) 


The two marker tests served their purpose in distinguishing this 
factor, but one of the tests showed some parallel cognition variance, 
as pointed out earlier. 


DFC—Divergent production of figural classes 


1. Alternate Letter Groups (DFC) 57 
21. Multiple Grouping of Figures (DFC) 40 
19. Multiple Figural Similarities (DFC) .35 (35 CFC) 


The three tests hypothesized to measure DFC were loaded on 
this factor. Multiple Figural Similarities, however, had a second 
loading of equal strength on CFC, indicating that the cognition 
aspect was not well controlled. Only tests designed for DFC ap- 
pear significantly loaded on this factor. 


DSC—Divergent production of symbolic classes 
22. Multiple Grouping of Nonsense Words (DSC) .51 (.30 CSC) 
24. Name Grouping (DSC) 39 
23. Multiple Letter Similarities (DSC) 37 


Three tests designed for DSC were found loaded significantly 
on it, with no tests designed for other factors. Multiple Grouping 
of Nonsense Words, however, had a minimally significant loading 
on CSC, the cognitive parallel. 


DMC—Convergent production of semantic classes 


36. Utility Test (DMC) 13 
20. Multiple Grouping (DMC) 57 
2. Alternate Uses (DMC) 46 


This factor has three univocal tests loaded on it, as epis 
with Utility Test based upon the shift-score principle, clearly lead- 
ing the three, 


-—7—————Ó—— EHE 
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NFC—Convergent production of figural classes 
8. Figure-Concept Grouping (NFC) .50 
6. Figural Hierarchieal Grouping (NFC) Al (.42 CFC) 
9. Figure Exclusion (CFC) .38 (44 CFC) 
10. Figure Grouping (NFC) 32 


Three of the four tests designed for NFC performed primarily 
as expected, two of them being univocal. As pointed out earlier, 
Figural Hierarchical Grouping was not free from figural-cognition 
variance. The figures tended to be complex, with more attributes 
than usual in order to determine the necessary hierarchical classifi- 
cation in each problem. This condition evidently generated some 
cognitive problems. 

The test that failed, Restricted Figural Classification, differed 
from the rest, in that it called for two different partitionings of a 
set of six exemplars. Calling for more than one pair of groups might 
lead one to expect some DFC variance, but neither this factor nor 
any other in this analysis was loaded significantly on this test 
There is no apparent hypothesis to suggest what factor outside 
this analysis might be strongly represented. 


NSC—Convergent production of symbolic classes 


16. Letter Grouping (NSC) 61 
82. Restricted Symbolic Classification (NSC) AT 
15. Letter-Group Exclusion (CSC) Al 
14, Letter-Concept Grouping (NSC) 30 


Three tests designed to measure NSC had significant loadings on 
that factor, plus the exclusion type of test designed for CSC, which 
failed to have much cognitive variance. The tests designed for NSC 
Proved to be univocal for it, but one with only a marginally signif- 
leant loading. 


NMC—Convergent production of semantic classes 


ie Group Classification (NMC) 43 (41 CMC) 

» Word Grouping (NMC) 42 

f^ Concept Grouping (NMC) 37 (49 CMU) 
* Verbal Classification (CMC) .86 (35 CMC) 

2. Largest Class (NMC) 34 
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Two of the four tests for NMC were loaded univocally on 
NMC—Word Grouping and Largest Class. The resemblance be- 
tween Largest Class and the exclusion types should be clear, the 
major difference being that the smaller class in an item in Largest 
Class is likely to have more than one exemplar left in it. Largest 
Class does not share any significant variance on the corresponding 
cognition factor, however, as some of the exclusion tests do. 

Two other tests designed for NMC have significant loadings on 
it, but also substantial second loadings. Group Classification shares 
an equal amount of variance with CMC, undoubtedly because there 
is some difficult in seeing the common attributes. The strong CMU 
loading for Concept Grouping suggests some difficulty with vocab- 
ulary level in that test or some need for precision of meanings. In 
order to control for cognition of semantic units in the semantic 
classes tests, efforts were made, although not with complete suc- 
cess, to keep vocabulary level well within the range of ability of all 
high-school students, 

In Verbal Classification S is to assign words of a list to one of 
two classes or to neither, the classes being defined by two given 
groups of words, the words of each group sharing a common prop- 
erty. This task could be considered as forming a unique classifi- 
cation of words in three exclusive classes, From this point of view 
the loading on NMC is reasonable. 


The Nonclass Factors 
CM U— Cognition of semantic units 
88. Verbal Comprehension (CMU) 68 
40. Word Completion (CMU) 62 
4. Concept Grouping (NMC) 49 (37 NMC) 


The striking thing about this list is the absence of all except one 
of the nonvocabulary, semantic tests, indicating generally 
control of vocabulary level in the semantic tests. 


CMS—Cognition of semantic systems 
30. Problem Solving (CMS) 60 
34. Ship Destination (CMS) 45 
44. Sex 38 (49 SEX) 
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No classes tests had significant loadings on factor CMS, nor did 

tests for other reference factors. CMS was the only ability 
or on which the variable of Sex had a significant loading. The 
ositive loading indicates a sex difference in which boys are superior 
. The common name for this factor is “general reasoning,” 
hich has a history of being correlated with sex membership. 


DSU—Divergent production of symbolic units 

1. Word Fluency (DSU) .66 

Suffixes (DSU) 62 

No other tests, whether for classes or not, had significant rela- 
onships with DSU. 


NMU—Convergent production of semantic units 
). Picture Group Naming (NMU) 51 
ord Group Naming (NMU) 45 


i Naming Meaningful Trends (NMU) 40 
"The two leading tests, originally designed as classes tests, per- 


sted here in going on factor NMU as previously, in spite of the 
usually large number of classes tests in the battery. The naming 
| (finding the right word or verbal expression) definitely out- 
tighed cognition of classes as a source of test-score variance. The 
ne kind of result is true for the trends test, which had been 
igned originally for CMR, relations being involved instead of 


SEX 
- Rex 49 (38 CMS) 


is somewhat common when the tested sample has members 
both sexes involved, a Sex variable is analyzed with the test 
fiables in order to take care of common sex differences. In this 
alysis there proved to be only trivial sex differences except for 
for factor CMS. 


Discussion 
attempt to lend empirical support to the part of the SI 


del that is concerned with classes was quite successful. Eleven 
20 classes abilities depicted by the SI model were investigated 
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and identified by this study, two for the first time. It was also the 
first time that many of the others had appeared together in the same 
analysis; previous analyses have tended to keep within the same 
operation category, except for incidental inclusion of reference 
factors outside that category. With 36 tests employed to identify 
these 11 factors, 34 of them were found to have loadings consistent 
with their hypothesized content; only two were exceptions. The 
other, reference, factors were identified by their marker tests as 
expected. 


Cognition Factors for Classes 


The three previously established cognition-of-classes factors, 
CFC, CSC, and CMG, were again identified. Nine of the ten tests 
hypothesized to measure the three cognition-of-classes factors per- 
formed as expected. Two of the three CFC tests, three of the four 
CSC tests, and two of the three CMC tests were univocal for their 
respective factors. Thus, seven of the ten tests designed for cogni- 
tion of classes were univocal for their respective factors. 

In the description of tests in an earlier section, certain types of 
cognition-of-classes tests were pointed out—exclusion, inclusion, 
matching, and naming. On the whole, we can say that the exclusion 
type of test is not the best for cognition-of-classes abilities. The 
three tests of the inclusion type were all univocal for their respet- 
tive factors. Two matching tests and one naming test were success- 
ful as measures of cognition abilities. Two other class-naming tests, 
however, were expected from previous experiences to go on a nam- 
ing factor, NMU, and they did so in this analysis. 


Memory Factors for Classes 


Factors MSC and MMC were defined entirely by two tests each, 
as expected, This analysis has confirmed previous findings for the 
two memory factors. 


[ 
Divergent-Production (DP) Factors for Classes 


The DP factors in this study were defined entirely by the nine 
tests designed for those abilities. Seven of these were found to be 
univocal, and none failed to have significant loadings on their ap- 
propriate DP factors. DFC and DSC had previously been adequately 
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represented by only two tests each. The addition of three new tests 
for these factors, has buttressed the evidence for them. 

Of the five tests coming under the category of multiple grouping, 
four were univocal for their factors and the fifth just missed being 
univocal. The two tests classified under the description of “regroup- 
ing with a single exemplar outside the list” did rather poorly. The 
two tests under the principle of "shift" scores were strong and 
univocal, as usual. A shift score would seem to be the best for DP- 
of-classes abilities, with multiple-grouping scores not far behind, if 
we may generalize from this limited information. 


Convergent-Production (CP) Factors for Classes 


Ten of the 11 tests designed to measure CP of classes had signifi- 
cant loadings on their respective factors. Of the various categories 
of tests for CP abilities, those designated earlier as simple-partition- 
ing tests seem to have the best record in this analysis. Tests that re- 
quire partitioning and the inclusion of an extra-list target exemplar 
did less well. 

The one CP test for a maximally uneven partitioning of lists ap- 
peared to be a very complex test. The two tests that called for 
partitioning six units by threes in two different ways achieved 
success in one case and complete failure in another. A general 
conclusion that one might draw from all this experience with CP 
tests, is that the tests that require simpler actions are more likely to 
be univocal and strong for their respective factors, a generaliza- 
Na that can often be made with respect to tests in other areas of 
ability. 


Confusions of Contents and Operations 


x Although the solution of the factor problem was very clearcut 
in this investigation, there is some point in considering the few 
instances in which tests were of complexity two; none was of com- 
plexity three. As in other analyses of intellectual abilities, the 
Separation of factors with respect to content was quite easy. In 
this study an unusually stringent test of the distinctness of the 
content categories was possible. Many of the classes factors em- 
ployed had identical properties for the kind of content. None of 
these tests has a loading on a factor ordinarily defined by tests with 
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different contents. In fact, the significant loadings in this analysis 
all proved to be for factors in which the content was as expected. 

Predicting the right kind of operations for a number of tests 
was another matter, however. The “misses” with respect to kind of 
operation were almost entirely in the nature of confusions between 
cognition and production abilities, more with respect to convergent 
production than divergent production. There were only two tests 
designed for DP abilities that had second loadings on correspond- 
ing cognition abilities and no cognition tests that had second load- 
ings on DP abilities. But there were two CP tests with second load- 
ings on corresponding cognition abilities and three cognition tests 
with significant loadings on CP abilities. There were enough other 
tests that were univocal on both cognition and production factors, 
however, to uphold the general hypothesis of orthogonality be- 
tween the cognition and production categories of abilities. 

Some difficulties with the full separation of tests of the other 
operation categories of memory and evaluation, as well as those of 
DP and CP, from cognition factors have been noted in other studies 
by the Aptitudes Research Project. Such a systematic type of find- 
ing might suggest that cognition is a unique category of abilities; 
that cognition is basic to or is a necessary condition for the other 
kinds of functions. It is easy to see how such a principle could 
apply in connection with the production abilities, for if the in- 
dividual does not have the necessary information at his command, 
he cannot produce certain effects that depend upon that informa- 
tion. Guilford and Hoepfner (1966) have assembled information 
showing that when a wide range of cognition ability exists ina 
group, the extent of cognitive ability appears to set upper limits 
upon DP abilities, but the lower limits are about the same for all 
levels of cognitive ability. The same kind of principle could apply 
to convergent production, although the feature of restrictions 
might modify the picture. Another general hypothesis, which could 
also be true, is that tests constructed for noncognitive abilities have 
sometimes failed to rule out individual differences in cognition ex- 
perimentally. As pointed out earlier, for the control of cognition 
variance, the cognitive aspect of the task must be so easy that no 
one would fail by virtue of being weak in cognition abilities. 
Another means of control would be to ensure by selection that all 
Ss had a significantly high status on the relevant cognition ability. 
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Implications of the Classes Factors 


Bruner, Goodnow, and Austin (1956) referred to the learning 
and utilization of classes of objects as one of the most basie and 
important mental activities by which man adjusts to his environ- 
ment. This point has been underscored by the accelerated interest 
in and investigation'of concept learning and utilization during the 
past decade. However, most of the increased activity has been 
confined to experimental laboratories, while psychometric labora- 
tories have contributed little in this regard. 

Since 1960 a number of studies have attempted to establish re- 
lationships between intellectual abilities and concept learning. Al- 
though some of these (e.g., Allison, 1960, Bunderson, 1965; Lemke, 
Klausmeier, and Harris, 1967; Stake, 1961) have demonstrated re- 
lationships between abilities and concept learning, others (e.g., Dun- 
canson, 1964; Manley, 1965) have shown little or no relationship. 
None of these studies have used intellectual abilities dealing with 
classes. Dunham, Guilford, and Hoepfner (1968) have recently 
shown that the structure-of-intellect classes abilities discussed in 
this study have substantial relevance to the investigation of con- 
cept learning. 

Learning and utilizing class concepts appear to be fundamental 
aspects of intellectual functioning. If this is true, then it is un- 
fortunate that tests involving classes have been conspicuous by 
their absence in traditional intelligence scales. 
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Or recent years, resolution of such controversies 88 content 
versus acquiescence data (Berg, 1959; McGee, 1962; and Jackson 
and Messick, 1961) has often times been blurred by the use of dif- 
ferent scale construction strategies. Though differences may par- 
tially be attributed to the different aspects of the behavior mea- 
sured, differences may stem from contrasts in strategies used to 
develop scales. 

To compare strategies, Hase and Goldberg (1967) used six meth- 
ods to develop sets of 11 scales from the CPI. An attempt was 
made to predict various social criteria by the six strategies. Four 
of the six (factor analytical, empirical, theoretical, and rational) 
strategies were found to be equivalent in predictive effectiveness. 
The remaining two methods, random and stylistic, were substan- 
tially lower in predictive power as judged by the mean multiple 
correlations for the various 13 criteria, On the basis of these data, 
strategies may not be as relevant as some authors (Loevinger, 1957) 
have suggested. 

Some contrary evidence is offered by Heilbrun (1962) which 
shows that empirically derived scales provide a slightly more valid 
Seale than rationally derived scales for a measure of affiliative ten- 
dency, However, the correlation between scores on the scales was 
75 which would suggest that they were similar in effectiveness. 
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Besides the question of strategies used, the increased use of fac- 
tor analysis to develop instruments has revealed the question of 
using several factors as contrasted to a few. Peterson (1965) sug- 
gests that a more parsimonious approach may be to reduce the 
size of the instruments to abbreviated forms and thereby avoid the 
tendency to extract several factors. Second and third order factors 
according to Peterson tend to have as much communality of vari- 
ance as those lower order analyses with several factors. 

The criticisms raised by Peterson which could be formulated as 
the “tunnel vision” approach verses the “universal” approach bor- 
ders on several issues one of which is the number of predictors 
used and relatedly the “costs” (Cronbach and Gleser, 1957) of 
adding or subtracting them. To date, the common axiom regarding 
the number of items seems to follow the practice of limiting the 
number to the degree that intercorrelations between items are 
minimized and the correlations of items to criteria are maximized 
(Davis, 1952). Adding items presumably would have little effect 
towards decreasing validity unless more predictive variance regard- 
ing a given criterion was associated with the variable added. 

From these considerations and others have emerged a major con- 
troversy concerning the nature of the relationship between items 
on a scale. Cronbach and others have posited that high internal 
consistency and hence high correlation between items is the most 
favorable characteristic. By contrast, strong theoretical evidence 
(Cattell and Tsujioka, 1964; Comrey, 1961) has been offered for 
the superiority of the heterogeneous strategy for scale construc- 
tion. They cite higher transferability from sample to sample as 
being the primary advantage. The implications here and elsewhere 
in Cattell's writings are that the factor-pure scale is the method to 
be employed to obtain this heterogeneity. No mention is made by 
them of multiple correlation which is the more obvious method of 
getting heterogeneity. 

The authors consider it to be important to study this problem 
empirically and have chosen the area of discrimination betwee? 
diagnosed psychiatric groups in order to provide control necessary 
for testing significance. The problem of separating neurotics from 
alcoholics is sufficiently difficult to provide the "top" necessary e 
obtain variability in measures of discriminatory efficiency. It was 
hoped that the results from such a venture would give evidence 
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concerning the best strategy for selecting items in test construction. 


Method 


Items from Cattell’s 16 Personality Factor Test (16 PF) were 
used to discriminate alcoholics from neuroties. One initial pool of 
items was selected by choosing those most highly correlated with 
the aleoholism-neurotieism distinction. A somewhat different pool 
was obtained by using frequencies of response for the selection of 
items. Refinements were made on each pool by applying factor 
analysis, stepwise regression, and ranking on the original selec- 
tion criterion to obtain optimal scales of 20 items each. In spite of 
the similarity in purpose, these had somewhat different constitu- 
enoy. 


Subjects 


The 16 PF was administered to 114 males and 74 females rang- 
ing in age from 20 to 60 all of whom had either visited an outpa- 
tient clinic or had been admitted to a psychiatrie hospital for the 
treatment of alcoholism. For comparison, the 16 PF was also given 
to another group of 32 males and 96 females whom had either 
made a visit to an outpatient clinic or had been admitted to the 
Psychiatric hospital for neurotic symptoms. The alcoholic and neu- 
totic samples were each divided into validating and cross-validating 
groups with the same sex ratio being maintained. Thus, the two 
subsamples consisted of 57 male alcoholics and 16 male neurotics 
and two subsamples of 37 female alcoholics and 45 female neurotics. 


Design 


R A correlation matrix was calculated for each sex among the 187 
items of the 16 PF and an additional variable indicating the nature 
of the disorder. Those items which had a positive or negative co- 
efficient greater than .150 with the disorder were selected for the 
first two item pools. 

An item pool for each sex was also selected using the frequencies 
un each of the three responses to any given item. This became a 
basis for assessing the comparative differences in the response be- 
p of the alcoholics and neurotie groups. If the difference in 
Tequency for a given category of responses was greater than plus 
or minus six, then that item was included in these pools. 


642 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


"The four pools of items selected by these criteria were identified 
as follows: (1) Male Frequency — 57 items; (2) Female Fre- 
quency = 73 items; (3) Male Correlation = 40 items; and (4) 
Female Correlation = 51 items. The pools were further refined us- 
ing methods of factor analysis, stepwise regression, and rankings on | 
the selection criteria. 

The factor analytic selection of items was from those items 
having .350 loadings on the factors in which the alcoholism variable 
had its principle loadings. The order of selection favored those - 
factors with the highest alcoholism loadings first. In cases where à 
given item loaded on more than one alcoholism factor, it was still 
represented only once in the finished 20 item scale. 

The stepwise regression method selected the 20 items which 
most optimally explained the variance in the alcoholism variable, 
In some cases, this did not correspond to the first 20 items added 
to the prediction equation. Since 20 item scales were the goal, F 
ratios were not used as a criterion for the abortion of the stepwise 
analyses. 

Selection of items by ranking was applied to the original pools. 
Items from the correlation pools were ranked by magnitude of 
coefficients, and the first 20 items in descending magnitude were 
selected. Items for the two frequency pools were ranked in a similar 
manner. Those items with the higher absolute differences in re- 
Sponses were selected. Again, the first 20 items from the pool were 
used in the final validation analyses. 

In this way, a total of 12 scales were developed from all pools 
and were labeled as follows: (1) Female-Correlation-Ranking 
(FCR); (2) Female-Correlation-Factor Analysis (FCF); (3) Fe- 
male-Correlation-Stepwise Regression (FCS) ; (4) Male-Correlation- 
Ranking (MCR); (5) Male-Correlation-Factor Analysis (MCF) 
(6) Male-Correlation-Stepwise Regression (MCS) ; (7) Female- 
Frequency-Ranking (FFR) ; (8) Female-Frequency-Factor Ana- 
lysis (FFF); (9) Female-Frequency-Stepwise Regression (FFS); 
(10) Male-Frequency-Ranking (MFR) ; (11) Male-Frequency-Fac- 
tor Analysis (MFF); and (12) Male-Frequency-Stepwise Regres- 
sion (MFS). All three selection methods were comparable in that 
the first item selected was by that method the most predictive of 
alcoholics and the subsequent items were progressively less pre- 
dictive. 
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der to determine if length of scale is an interacting factor 
choice of a strategy to be used in the selection of items, 
became a parameter in this study. Each of the 12 scales was 
d with varying lengths including 6, 8, 10, 15, and 20 items 
_ It was thus possible to analyze respective efficiency at any 
h and also to test the characteristic of the scales as the size was 


12 scales each with the five varying lengths were initially 
| against the original samples for validation purposes. To test 
ferability of the scales obtained, the same items were applied 
Temaining halves of the subject pools thus providing cross- 
ional data on the scales. 


Results 


minant functions (Rao, 1952; Dixon, 1967) were used to 
effective discrimination for each of the 12 scales for the 
ation and cross-validation samples. Discriminant functions 
mputed for the groups of 6, 8, 10, 15, and 20 item counts for 
the 12 scales. The 60 analyses, represented in Table 1, give 
ompanying F-ratios for each of the 12 scales across the 
number of items included. The original sample from 
he scales were constructed was discriminated beyond the 
by nearly all methods applied. However for oross-valida- 
nples, the disparity in effectiveness between methods be- 
ore obvious. With the exception of the factor analyzed items 
e male correlational pool (MCF), most F-ratios of the fao- 
ed pools were significant only at the .05 level. The other 
yielded significance only for isolated occurrences. The 
ion of items with stepwise regression seems to provide highly 

mt F-ratios when limited to the population on which the 
selected. However, there was a noted decrease of ratios 
alidation samples indicating low transferability of this 


sts were ranked and classified as either alcoholic or neurotic 
inant functions for each scale at varying lengths. Mis- 
ions, false positives, and false negatives, were tallied for 
hods and item levels using both original and cross-validation 
. A randomized block factorial design of analysis of variance 
d to test differences in misclassifications for methods of 
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TABLE 2 


Analysis of Variance for Misclassifications on 
Discriminant Analyses for Females 


Source df SS MS 
A (No. of Items) 4 538.17 134.54 7.30* 
B (Methods) 5 406.14 81.23 4.41 
AB 20 432.03 21.60 1.17 
Blocks 1 149.07 
Residual 29 349.93 18.42 
Total 59 2,475.34 


* Significant at the .05 level. 


item selection and varying scale length. Independent analyses were 
conducted for each sex. For both males and females (Tables 2 and 
3), the number of items has a significant effect at the .05 level. 
Methods were significantly different for males at the .05 level, how- 
ever methods of selection for females only approached the .05 
level. Interactions between number of items and methods were not 
significant for either males or females. 

Misclassifications for the 20 item scales (Figure 1) were plotted 
for each method. Ranking for both correlation and fequency meth- 
ods produced more misclassifications, while the factor analysis 
and stepwise regression on a frequency pool had fewer misclassifi- 
cations. The graph illustrates that stepwise regression techniques for 
both correlation and frequency pools produced fewer errors when 
classifying the original population. However, there was a sharp 
increase in misclassifications when this method was applied to the 
cross-validation samples. 

The Mahalanobis Dsquare statistic (Table 4) for each discrimi- 
nation similarly demonstrated that the distances between alcoholics 


TABLE 3 
Analysis of Variance for Misclassifications on 
Discriminant Analyses for Males 
A (No. of Items) 4 125.10 49.27 8.22* 
B (Methods) 5 210.60 42.12 7.04* 
AB 20 57.90 70 12 
gods 1 614.40 
ie 29 173.60 5.99 
aa 59 1,181.60 


a uie 4e 60-000. e DETIIISEEN 
* 
Significant at the .05 level. 
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Figure 1. Comparative efficiency of 20 item scales for avoiding misclassification 
riminating alcoholics from neurotics. 


TABLE 4 


Boletos Dsquares between Neurotics and Alcoholics for Original and 
ross-Validation Samples Varying the Number of Items 


Construction ( 


Methods O CG o cœ Nro end 0d 
MOR 2.666 1.816 3.203 2 109 84 2. 
; à ` 3.643 2.123 4.400 2.424 5.1 

Mog 1.092 0.868 3.41 1.014 4.115 1.07 8300 1.404 10.280 a4 
MFR 21608 Q1. 8.797 2.544 11.588 2.788 15.104 3.305 17.353 5 
MEF 4/730 ue 3.641 0.880 4.052 1.438 5.920 1.979 0.749 23 
MrS 6.092 1:397 5.999 2.805 6.681 2.901 7.090 3.908 7.488 7 
FCR 1:906 dS 7.473 1.482 8.645 1.784 11.349 2.672 14.183 $ 
FCF 1.879 0o 2.001 2.304 2.575 2.385 3.107 2.931 429 7 

FOS 1.750 (422 1.047 1.321 2.187 1.439 3.623 2.208 4.208 í 

FRR 1425 (199 2.596 0.269 3.455 0.611 5.016 1.105 6.90 5 

m 219 0.416 2.141 0.779 2.874 1.343 4.294 1.500 4-401 7 

. FPES 3:500 1:205 8.122 1.085 3.492 2.150 4.535 2.965 5.97» 7 
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and neuroties are greatest for the refinements of stepwise regres- 
sion and factor analysis on frequency pools. The distance increases 
as items are added to a scale for all methods, As previously demon- 
strated by the number of misclassifications, the selection of items 
by ranking alone showed least separability as reflected by Dsquares 
for both original and cross-validation samples. 

Spearman Rank correlations (Table 5) were computed between 
discriminant function indices of original and cross-validation 
samples. The purpose was to determine whether the discriminant 
power of each item of the scales maintained its respective rankings 
for the cross-validation samples. As indicated in Table 5, only the 
male factor analytic scale obtained from the correlation pool re- 
flected any significant stability. 


Discussion 

Generally, the study suggests that using initial selections based 
on frequency of response differences followed by refinement tech- 
niques of stepwise regression and factor analysis offer some advan- 
tages. The significance levels for the discriminant function analy- 
ses were beyond the .01 level for the MFF, FFF, and FFS scales. 
These scales maintained a .05 level of discrimination on the cross- 
Validation samples. The same trend was found with misclassifica- 
tions of the analyses. Differences in scales were significant at the 
05 level for males and approached significance for females. 

Graphing of the 20 item scales shows that the frequency methods 
Tefined by factor analysis and stepwise regression produced fewer 
misclassifications. On the original samples, separation of alcoholics 
and neuroties was very good with perfect discrimination for one 
Scale, the stepwise regression scale from the male correlational 
Pool, However, the number of misclassifications sharply increases 
for stepwise scales on cross-validation samples as compared to other 
Seales. The increase would suggest that stepwise regression scales 
En to capitalize on error specific to a given group. The same 
Ep is found with discriminant functions; inflated F-ratios were 
ound for the original samples and contrasting decreases were ob- 
pou for cross-validation samples. 

It dia appear that factor item selection as performed in the study 
thay offer certain advantages of generality as contrasted to stepwise 
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gression. As noted earlier, selection by stepwise regression seems 
to capitalize on sample specific variance. Decrease in generality 
may indicate spurious heterogeneity for stepwise regression. Valid 
heterogeneity was evident with factor analysis, since no less than 
four orthogonal factors were used to select items. Thus, the argu- 
ments of Loevinger and Cattell for greater transferability of hetero- 
"geneous sampling devices such as those constructed by factor anal- 
ysis seems supported. The techniques of stepwise regression may 
be restricted to refinement of existing scales after standardization 
procedures. Such a use would have interesting implications towards 
constructing scoring keys with variable item weightings. 

The slight advantage of refining pools of items based on fre- 
quency of responses may be specific to an interaction between the 
populations tested and to the nature of the instrument used in the 
study, the 16 PF. Neurotics may have responded to the middle or 
neutral response more frequently than alcoholics. Subsequently, 
the relationship between the neuroticism variable and a scale based 
on the three possible responses for any item could have been 
curvilinear. The lack of a linear regression would thereby render 
correlation method inferior to the frequency method which is more 
sensitive to this curvilinearity. It is also possible that the larger 

| pool obtained for frequency methods might have allowed more 
optimal selection of items. 

The somewhat unexpected decrease in F-ratios of the diserimi- 
nant analyses with a corresponding increase in the number of items 
may only indicate that the “error” term becomes inflated when 
items are increased beyond an optimal point. The increase of the 
eror term seems to covary with the expectancies of F with corres- 
Ponding changes in degrees of freedom. Thus, adding items does 
Not seem to seriously decrease the validity of the instrument; in 
fact, fewer misclassifications did occur with the inclusion of more 

items, Malahanobis Dsquares for each total of items also gives the 
Mpression that the distinguishability between groups is increased 
With additional items. 
d In conelusion, techniques do effect the ostensible validity of an 
instrument. Stepwise regression offers advantages in concurrent dis- 

| “mination but loses strength in cross-validations. Factor analysis 
“ems to offer more generality, and subsequently more real pre- 
dictive validity, particularly if more than one factor becomes the 
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basis for item selection. Ranking methods seemingly lack generality 
of prediction suggesting that homogeneity is detrimental. The addi- 
tion of items to a scale appears to produce more discriminability 
between groups. 

More comparisons using larger samples of specific populations 
could clarify more precisely the types of construction which offer 
the greatest stability and discrimination. A more thorough study | 
should be made of the differences from factor-pure scales and the 
factor-complex scales used in this study. The latter seems to be much 
more appropriate for the complex discrimination asked for in most 
“type” distinctions. This provides further evidence that heterogene- 
ity in scale construction increases transferability from sample to 
sample. The use of several factors for item selection for a given 
scale seems to be more effective in providing meaningful heterogene- 
ity than does stepwise regression which incorporates more spurious 
specificity in its item selections. 


Summary 


Inductive methods of correlation and frequency discrimination 
were used to collect two pools of items to discriminate alcoholics 
from neurotics. The items, which were taken from the 16 PF, were 
then reduced according to their effectiveness of prediction to the 
20 most predictive by each of three methods, factor analysis 
stepwise regression, and ranking. Twelve scales were then con- 
trasted by their ability to discriminate between alcoholics and 
neuroties of an original and a cross-validation sample. 

Using a discriminant function analysis to test the effectiveness of 
prediction, it was found that frequency pools with refinements 0 
factor analysis and stepwise regression offered the greatest advan- 
tages. Factored scales appear to offer more generality than those 
obtained by stepwise regression, but both were better than the 
homogeneous scales derived from the highest ranked items on ie 
criterion. This provides further evidence that heterogeneity 
Scale construction increases transferability from sample to sample. 
The use of several factors for item selection for a given scale seems 
to be more effective in providing meaningful heterogeneity than 
does stepwise regression which seems to incorporate spurious Spec" 
ficity in its initial success, 
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THE EFFECTS OF PRETEST COMMITMENT AND 
INFORMATION UPON OPINION CHANGE 


FREDERICK J. PAULING 
John Jay. College 
AND 
ROBERT E. LANA 
Temple University 


Ir now seems evident that, under conditions where Ss are pre- 
sented sequential opposed communications, the prior administra- 
tion of a pretest questionnaire will inhibit the amount of opinion 
change produced by the communication (Lana, 1966). This inhibi- 
tion has been found to operate with equal facility on change toward 
or away from either of two opposed arguments. In previous studies 
(Lana and Rosnow, 1963; Lana, 1964; and Lana, 1966), the aware- 
hess of the S's taking the pretest was manipulated prior to the 
exposure of all Ss to the two-sided persuasive communications: 
One group completed a pretest opinion questionnaire relevant to 
the subject matter of the communications; a second group took 
the same pretest but under conditions which made it appear to be 
part of another test situation, presumably unrelated to the experi- 
mental situation; a third group, the control group, was not given 
^ pretest in any form. On the basis of the consistent finding that 
differential opinion change is directly related to the level of pre- 
test-taking awareness, Lana (1966) has suggested that the basic di- 
Tension involved may be the degree of external commitment to an 
Opinion prior to exposure to the communication. According to this 
Proposition, the taking of a typical pretest would represent à form 
x Public commitment since the Ss’ responses are identified with 
his signature and recorded by the experimenter. Hence, one would 
pect less public commitment under conditions where the S re- 
"Donds to opinion questions which are “hidden” in another task 
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(such as a classroom exam, Lana and Rosnow, 1963) since the 8 
would be uncertain as to the relevance of his responses. Finally, 
when no pretest is given, there can be no initial commitment by 
the subject. 

The current research report presents the results of three syste- 
matically related studies. Experiment I and its replication (Experi- 
ment II) have a two-fold objective: to achieve greater precision in 
the experimental specification of different levels of pretest com- 
mitment, and to test the adequacy of Lana’s conceptualization re- 
garding the inhibitory effect of pretest-taking on opinion change. 
In the third study the differential effects of private vs. public pre- 
test commitment under equivalent experimental conditions were 
examined. The assumption here is that responding privately to à 
pretest opinion questionnaire without public exposure of one’s opin- 
ion position should produce a weaker pretest commitment than 
responding to the questionnaire in the conventional manner. There- 
fore, a private commitment condition should produce greater opin- 
ion change than a public commitment condition. Since Experiment 
III duplicated the essential conditions involved in Experiments I 
and II, it can be considered, in part, a further replication. 

The principal hypothesis of all three studies is that the amount 
of opinion change, under conditions involving a two-sided presen- 
tation of persuasive communications, is an inverse function of the 
degree of pretest commitment the subject, identifies with his ink 
tial opinion. As a result of Studies I and II, it was necessary t0 
modify this position, and this modification is explained below. — 

The degree of pretest commitment can be experimentally manip- 
ulated by varying the pretest task instructions. Hence, taking 4 
pretest in the usual manner, which identifies the S's opinion posi- 
tion, would presumably yield maximum pretest commitment and 
consequently greatest inhibition of opinion change. However, in 
structing the S to read the pretest under some pretest and to refrain 
from indicating his opinion, should yield less pretext commitment 
and thus less inhibition of consequent opinion change. 


Experiments I and II 
Subjects Y 


The Ss in Experiment I were 69 undergraduates enrolled in 
an introductory psychology course at Alfred University. Toa 
in Experiment II were 62 Alfred University undergraduates 
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rolled in introductory psychology and social psychology. There 
were roughly equal numbers of males and females in both cases. 


Materials 


The communications employed in Experiment I consisted of a 
pro and a con argument on racial integration. The pro and con 
arguments used in Experiment II were concerned with the use of 
nuclear weapons. Each of the four arguments used in both studies 
was approximately 425 words in length. In each study, the pro-con 
communications were based on five points of argument for and 
five points of argument against specific proposals relating to the 
issue. Both sets of communications were written by the second 
author. These arguments had been previously submitted to a 
group of students for evaluation of the equivalence of the pro and 
con arguments in terms of stylistic factors and overall persua- 
sive effectiveness. The results (Lana, 1962) indicated that pro 
Communications were perceived as similar in style and effective- 
hess to their respective con communications. 

The opinion questionnaires employed in both studies consisted 
of five Likert-type items for each of the topics. Both of the ques- 
tionnaires, on racial integration (Experiment I) and the use of 
nuclear weapons (Experiment II), were previously developed by 
Lana (1962). The range of possible scores on the two question- 
haires was five at the lower limit and 26 and 24, respectively, at 
the upper limit. A low score represents support of the con argu- 
ment, and a high score represents support of the pro argument. 


Procedure 


Experiment I. The Ss were randomly divided into six groups on 
the basis of two orders of presentation and three pretest question- 
naire conditions. 

The first two groups received an initial opinion questionnaire 
on the topic of racial integration to determine their pretest opinions 
on the issue. Three days later, one group was read the pro racial 
‘tegration argument immediately followed by the relevant con 
argument (Pro-Con), and a second group was presented the same 
*tguments but in reversed order (Con-Pro). The next two groups 
Were given the same pretest opinion questionnaire but were in- 

acted to read the items carefully and refrain from checking 
heir opinion responses. They were led to believe that their pretest 
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task was to evaluate the overall elarity and grammatical correctness 
of the questionnaire items and were provided with a rating scale 
for this purpose. Hence, these Ss were exposed to the same ques- 
tionnaire material as the first two groups but were not required 
to publiely eommit their opinion position on the issue. One of 
these two read-pretest groups received the pro-con order of argu- 
ment presentation and the other group the reversed order. The 
final two groups did not receive any pretest questionnaire ma- | 
terial or instructions, thus serving as controls for the pretest ma- 
nipulations. Again, there was a breakdown on the basis of order of 
presentation into pro-con and con-pro groups. Immediately after 
the presentation of the opposed arguments by the same communi- 
cator, all groups were administered a posttest opinion questionnaire 
which was identical to the pretest questionnaire. All Ss were in- 
structed to fill out the questionnaire in the conventional manner 
by checking their opinion responses. The dependent variable of 
opinion change was the absolute difference between the pretest 
and posttest opinion scores of all groups. 

The pretest mean scores for those groups that did not fill out 
the pretest (the read pretest and no pretest groups) were esti- 
mated from the pretest mean scores of the pretested groups. This 
pretest estimation procedure appears justified on the basis of the 
high level of consistency among the pretest means of other college 
samples involved in previous opinion change research employing 
the same opinion questionnaire (racial integration). See Lana, 
(1966) for details. ` 

Experiment II. This study was an attempt to replicate Experi- 
ment I systematically by examining the effects of introducing 8 
different topic of communication—namely, the use of nuclear weap- 
ons. All other aspects of Experiment II exactly replicated the 
conditions of Experiment I. A summary of the research design 
for Experiments I and II follows: 


TABLE 1 
Design of Experiments I and II 


Pretest Read Pretest No Pretest 


Three Days 
Pro-Con Con-Pro Pro-Con Con-Pro Pro-Con Con-Pro 


Posttest 
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Results 


The results are given in Table 3. The order predicted from the 
mitment hypothesis should have yielded the least opinion 
ge in the pretest only group, the most in the no pretest group. 
read pretest group should have yielded an amount of change 
n these two groups. The actual order of experimental change 
indeed show that the pretested group changed the least, but 
e other two groups were reversed in their effects. The read- 
group changed the most, and the no-pretest group was 
d. In attempting to explain this reversal, it occurred to the 
ors that, another factor, apparently present in the experimental 
ons, may have been operative. This factor was the amount 
ormation available to the subject in the form of the content 
questions in the pretest to which he was exposed before 
ntation of the two opposed communications. This informa- 
may sensitize the S to later presentations of materials and 
‘in responses different from those of Ss not subjected to a 
In the past (Lana, 1961, Lana and Rosnow, 1963, Rosnow 
Lana, 1965), amount of information (familiarity) available to 
ubject was shown to influence amount of opinion change. Com- 

g the three experimental groups indicates that a combina- 
| of the pretreatment commitment with familiarity would, in 
eX-post-faeto manner, explain the results. The largest opinion 
was evident in the group that read the entire pretest but 
lot fill it in. This group thus made no commitment, but re- 
d the same pretreatment information as the Ss in the pretest 
ho were asked to fill in the questionnaire and hence re- 
publicly their opinions. The Ss in the no-pretest group were 
d for no pretreatment commitment, but also received no in- 
mation. This group would be expected to change less than the 

formation no commitment group which is precisely what 
ur. Apparently, commitment is a more powerful inhibitor 
information is facilitory of opinion change since the no-pre- 
toup changed more than the pretest only group. (In the pre- 
nly group, apparently, high commitment yields greater in- 
of opinion change than high information is facilitory of 
change.) 


658 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


However, this is an ex-post-facto analysis. Further tests were 
needed to examine the modified commitment hypothesis. 


Experiment III 


The Ss used in Experiment III were 104 undergraduates enrolled 
in the introductory course in psychology at Long Island University. 
The numbers of males and females in the sample were about equal. 

Two groups were designed to represent experimental conditions 
which would presumably yield a magnitude of opinion change be- 
tween the read pretest and no-pretest groups and between the no- 
pretest and pretest only groups of Experiments I and II. 


Materials 


The topic of the use of nuclear weapons in warfare was used 
as the content of the persuasive messages, one of which argued in 
favor of their use (pro) and the other of which argued against 
their use (con). These arguments, along with the five-statement 
Likert-type questionnaires, were exactly those used in Experiment 


Procedure 


The subjects were divided randomly into five groups. Groups 
1, 3, and 5 duplicated the pretest only, read-pretest and no-pretest 
groups of Experiments I and II and thus provided a replication of 
these studies. Groups 2 and 4 were designed to provide degrees of 
commitment between the pretest only (high commitment) group 
and the no-pretest (low commitment) groups while providing 
more information to the subject than the no-pretest group with- 
out entailing as much commitment as in the pretest only group. — 

That is, Group 2 was instructed to read the opinion question- 
naire statements but not the Likert-type opinion response ai 
tives attached to each question, which were left off the question- 
naire. The idea was to introduce the information about the topi? 
contained in the statement without suggesting, by the presence is 
the Likert alternatives, that a commitment would be asked of " 
subject, (read-body-of question group). Group 4 was asked to Tea 
the complete pretest questionnaire (Likert alternatives included) 
and to covertly, ie., without writing it down, choose the various 
alternatives that reflected their beliefs. This is labeled the private- 
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"eommitment group. All five groups were then subdivided ran- 
domly into the usual pro-con and con-pro order groups. Pretests 
were administered to the pretest group (Group 1). One week later, 
all groups received the communications and immediately afterward 
were asked to fill out the posttest opinion questionnaire. The pre- 
diction following from the above discussion is that the amount of 
opinion change in either (pro or con) direction would be greatest 
in the read pretest group (Group 3) followed, in descending order 
of amount of change by the read-question-body group (Group 2), 
the no-pretest group (Group 5), the private-commitment group 

(Group 4) and the pretest only group (Group 1). The read- 
pretest group is provided the same amount of information (sensi- 
tization) as any of the other groups who are presented the complete 
pretest, but less commitment is expected of them and thus, follow- 
ing from the revised hypothesis of this study, they should change 
Opinion more than any of the other experimental groups. The 

- pretest-only group receives as much information as the other 

groups, but the commitment requirement is strongest. Thus it 
should change the least since commitment is more inhibiting than 
information is facilitatory of opinion change. The read-question 
body group receives slightly less information than those groups 
who sce the entire pretest and they are not made aware of response 
alternatives which, therefore, presumably reduces any tendency 
toward commitment. The no-pretest group receives no informa- 
tion and makes no commitment so it acts as a control group and is 
Placed in the middle of the expected magnitude of opinion change 
Continuum, The private-commitment subjects receive exactly the 
same conditions as the pretest-only group except that they are in- 
structed to make their commitment privately (without writing 
them down) and hence they are placed above the pretest-only 
group for expected magnitude of opinion change. The design is 
Summarized in Table 2. 


Results 


The results of all three studies were combined across experi- 
Mental conditions, and t-tests for independent means were carried 
Sut for each of the absolute-mean difference scores between pre- 
test and posttest for all experimental conditions. Error terms for 
the directly pretested groups were calculated as the variance of the 
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TABLE 2 
Research Design Experiment III 
Groups 
1 2 3 4 5 
Read Question 
Pretest Bodies Read Pretest Private Com. No Pretest 
One Week 

Pro-Con-Con-Pro — P-O C-P P-C C-P PC CP PC CP 

Posttest 


pre-post difference scores. For those groups whose pretest score 
is an estimated constant, error terms were considered equal to the 
variance of the posttest scores about the posttest mean plus the 
square of the pre-post difference. The mean absolute change scores 
are reported in Table 3. The predictions were confirmed in terms 
of the order of the combined absolute mean change scores, but 
differences between these scores were not all significant. Table 4 
summarizes the results of 2-tailed t-tests for independent means 
performed on the various combined absolute mean change scores. 

The variances of the absolute mean difference scores were ex- 
amined against one another and none of the ten F-ratios were 
significant at the .05 level. Clearly the experimental treatments 
had no effect on variability from pre- to posttest. Variances were 
computed as the variance of the posttest scores plus the square of 
the pre-post difference. 


Discussion and Conclusions 


Although all differences are not statistically significant, the order 
of change of the various pre-condition groups was as predicted. 
The results of all three experiments affirm the observation (Lana 
1966) that pretested groups, by and large, change significantly 
less than unpretested groups. All studies indicated that commitment 
by the S, as measured by degree of public statement of opinion, 
interacted with amount of information available in the pretest io 
produce greatest change of opinion in groups exposed to sy?) 
mental conditions which have high information available combine 
with a minimum commitment to an opinion position by the eel 
ject. 
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TABLE 4 
Summary of t-test Results for Experimental Conditions, All Studies Combined 


——— — | 
Read Pretest Priv. Comm. (III Only) 

vs. No Pretest, vs. Pretest Only 

(Studies I, II, III Combined) t<1 df=42 

12240 df=127 p«.01 

Read Pre (I, II, III Combined) Priv. Comm. (III) 
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vs. Pretest Only vs. Read Qu. Bodies 


12286 df=127 p<.01 t<1 df =35 

Pretest (I, II, III Combined) Read Qu. Bodies (III) 

vs. No Pretest vs. Pretest Only 

t<1 df =124 t<1 df= 

Private Commitment Read Qu. Bodies (III) 

vs. No Pretest (Study III Only) vs. No Pretest 

t<1 df =38 t<1 df =38 

Private Commitment (III Only) Read Qu. Bodies (III) 

vs. Read Pretest vs. Read Pretest 

t=165 df=38 p<.10>.05 t= 1.08 df =38 p» 45 
CEU S e-l a) Cee 


Even though all of the directions of change predicted were sup- 
ported, only the read pretest group was significantly different from 
all but one of the other groups. However, the consistency of the 
results in an area of enquiry involving as complex a set of factors 
as found in opinion change phenomena, is even further enhanced 
when one keeps in mind that the subjects yielding such results were 
heterogeneous indeed. Experiments I and II were conducted with 
college students from a small rurally located Eastern university 
and Experiment III on college students from a large urban wni- 
versity. Also the order of results were consistent over three separate 
experiments involving a total of 235 subjects. 

It is, therefore, concluded that some support is evident for the 


Minimum commitment and maximum information provided by 
^ pretest situation allows for a maximum amount of opinion 
change, as the result of a S's exposure to two sequentially pre 
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ELIMINATION OF THE GUESSING 
COMPONENT OF MULTIPLE-CHOICE TEST 
SCORES: EFFECT ON RELIABILITY 
AND VALIDITY* 


ROBERT B. FRARY 
The University of Miami 


 Unpertyinc much of the discussion concerning guessing on 
multiple-choice tests is the tacit assumption that better measure- 
ment characteristics, particularly improved reliability and validity, 
would result if guessing could, in fact, be eliminated. Indeed, it 
has been shown (Plumlee, 1952 and 1954) that if guessing is 
random whenever an examinee does not know an answer, elimina- 
tion of the scores attributable to successful guessing will improve 
| both reliability and validity. Of course, guessing on multiple- 
tlioice tests is often not random, and Gulliksen (1950) argues 
intuitively that guessing following elimination of one or more 
choices should not be discouraged. 
It is the purpose of this paper to determine the conditions under 
Which elimination of the guessing component of scores enhances 
their reliability or validity. It will be shown that change in re- 
[ 


liability or validity resulting from elimination of the guessing com- 
er well-prepared 


Ponent of scores depends primarily upon wheth 
scores and on 


or poorly-prepared examinees have higher guessing 
Whether the distribution of guessing scores with respect to exami- 
Nee ability is stable from one test to another (predictor to criterion 
or between parallel forms). 


Variance of Multiple-choice Test Scores 


. On a multiple choice test, the raw score (X) of an individual may 

*Portions of this paj dissertation at 

i per are based on the author's doctoral ion al 
Florida State University. Y 
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be expressed as the sum of items known or true score (T), the number 
of items guessed correctly (G), and an error term (E). 


X=TIG+E. ay | 
In deviation scores the same relationship is l 
=t+g+e. ay 


scores are uncorrelated with true or guessing scores and that E = 
so that e = E. 

For deviation scores x = t + g + e, it is easily shown that 
oe = of S of 4o) + 2(ruo,c, F ro,0. F MeFi) 
Under the assumptions just stated regarding error scores, fee ? 
Tı, are zero so that (2) becomes 


c, = aj + ol o + Rut 


In some cases r,, may be approximately zero. Swineford (1941) 
and others have found that propensity to guess and hence to inere: 
guessing score is, under some circumstances, unrelated to ability w 
measured by the test given. However, in many cases f: May be 
negative due to higher guessing scores for those who know lei p 
Zimmerman and Williams (1965) have found that under thy 
circumstances r,, may be expected to range between —.4 and 5 
for tests with 10 to 100 five-choice items. On the other hand, Te OS 
be positive. If those who know more are more adept at utiliz 
partial information or clues inadvertently furnished by the itali 
writer, they may have higher guessing scores than those who gud 
randomly or who in their ignorance are more easily intimidated Dj * 
admonitions not to guess. i 

If r.» is positive, it is clear from (3) that elimination of the guessin. 
component of the scores will reduce o,2. However, if r: is negativ. 
elimination of guessing scores may cause the variance of the ig Y 
scores to increase. To see that this result is not only possible ^ 


likely under ordinary circumstances, consider the following D 
equality: j 

9; + 3r«e,7, < 0, is 
which is equivalent to © 


Tig < —2,/22,. 
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Suppose that c, < .5e,. Then Tio need be no more negative than 
— 95 in order to satisfy (5) and consequently (4). Then, if (4) holds, 

; slimination of guessing would increase total variance [see (3)]. 

Reliability 
To show the effects on reliability of eliminating guessing scores, 
expression for parallel forms reliability is required. Here parallel 
rms refers to forms such that each examinee has the same true score 
1 both, total variance is the same for both, and error scores on one 
mm are uncorrelated with true scores on either form or error scores 
— 1 the other (Gulliksen, 1950). In addition it is assumed that error 
"ores on one form are uncorrelated with guessing scores on the other. 
Inder these assumptions, the following expression has been derived 
dy Frary (1969) for the correlation between deviation scores ¢ = 
t + g + e on parallel forms a and b: 


2 2 
riu E zi Prats E fans 6) 
E. s 
hj (3), the denominator of (6) may be replaced so that 
" pu oÊ F Wort + Tong a) 
l a oE + Woe + 0, d- 7. 


‘rom (7) it may be seen that the effect of eliminating guessing 
tes depends upon the size of 7,,..- Suppose guessing scores are 
inated. Then o,” = 0, and (7) becomes 

NES. (8) 


LT 
Tar) = BERE 
t e 


i 
l order to have raz,” > Taan it is necessary that 


Su ? E WTO E Toon? : 
^ Se Ln 190 106. ganio 9 
| ‘a. " o as c, E e, + 2r47i0, Tc Te ( ) 
| hich is equivalent to 

"ni 2 2. 

Mite fi s oo une tt (10) 


A 

“U7, Positive 

" i Suppose that r,, > 0. Then the larger a positive value of fıs the 
Smaller r,,, must be in order to have (10) and hence (9) hold true, 

eo to have reliability of scores corrected for guessing larger than 

Teliability of scores not corrected for guessing. If there is a strong 
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positive relationship between guessing score and true score (r,, large) 
it is reasonable to suppose that r,,,, may be large also. This relation- | 
ship is the opposite of that required for improvement of alternate 
forms reliability by elimination of guessing variance. Thus fsz» < 
T,,,, may be expected in this case, especially if r,, is large. 

However, if error variance is sufficiently small, the right side of 
(10) will be approximately unity, since 


So) — roo, + L^ , 
oe? t E 


Then any value of r,,,, will satisfy (10) with the result that reli- 
ability is enhanced by reducing guessing variance in this case. 
On the other hand 
; * + roto? + o, 

Jim E perit n + See GRO) (12) 
Thus elimination of a very small guessing variance would be most 
unlikely to improve alternate forms reliability if r,, is significantly 
positive. 


Tis Negative 


In this case (10) holds, but with r,, < O the right side of (10) 
increases as r,, becomes more negative. In fact, the right side of (10) 
will be greater than unity if r, < —1/4 when e, < c;/2. Further, 
if ri, < 0 is due to random guessing whenever the correct choice 18 
unknown, a positive r,,,, of only moderate size may be expected: 
Zimmerman and Williams (1965) found this correlation to range 
between .17 and .65 for tests consisting of 10 to 100 five-choice items, 
and on which all guessing was random. Thus, when r: < 0, it 
frequently be the case that (10) will be satisfied and elimination of 
guessing variance will increase reliability. 4 

If either c,' or oc.” is sufficiently small, elimination of guessing 
variance will increase reliability regardless of the size of Too» when 
Tio < 0. This result follows because (11) holds for all values of fi 
and because in place of (12) it is the case that 


2 Ce 
o a re e eco 
oy? t e 
Tio 0f Zero 


In this case (10) becomes 
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(14) 


Thus the case with which (14) and consequently Tsss > Tr. is 
satisfied depends only on the size of c". If error variance is large 
- with respect to true variance, r,,,, will have to be relatively small 
in order for elimination of guessing scores to enhance reliability. 


Summary 


The conclusions reached above may be summarized as follows. 
If the better prepared examinees, the examinees who guess intelli- 
gently on the basis of partial information, attain the highest guessing 
scores, ",, > 0 may be expected. Then elimination of guessing scores 
could not be expected to increase reliability. When examinees guess 
randomly whenever they do not know an answer or whenever less 
well prepared examinees have higher guessing scores, Ts < 0 may 
be expected. In this case eliminating guessing scores may be expected 
to increase reliability. Similarly, if there is reason to believe guessing 
Scores unrelated to true scores (r,, = 0), elimination of guessing 
Scores should inhance reliability, provided error variance is not rela- 
tively large and that r,,,, is not near unity. 


Validity 
The criterion scores may or may not have a guessing component. 
< Or à more general development, let a criterion score be the sum of 
its true, guessing and error components, or, in terms of deviation 
Scores, 
T y =t Fg Te a5 
E. en, modifying (1b), the deviation scores to be correlated with the 
lerion scores may be expressed as 
t= t, T Gz ze ez. uh 
BY definition 


emu. Be + ge + eli + ge) 
0,0, No.0, an 


= PU, + Etg, + x1,9, + 29.9, + Elres + hts F gay 965-649) 
0,0, 


Si * E 
SUCE error ig assumed to be random, the last term in the numerator 
18 zero. Then 
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— MtstyTteTty H+ PtgoyFteFoy + TtyosTtyTos T TosoyTosToy, (18) 
9;0y 


Tay 


If guessing scores are eliminated for the predictor test (g. scores) 


, 


Peet 000:, d- TtsoyO teFoy 

(c. + 2, ) "s, . (19). 
Here the denominator term (¢,,* + c.,") "^ is the standard deviation 
of the predictor scores after elimination of their guessing component 
[see (3)]. To simplify notation, let 


[24 A [C = Cond ie (20) 


As in the case of reliability, the degree of correlation between 
guessing scores on the two tests is a sensitive indicator of whether 
elimination of predictor-test guessing scores will increase or decrease 
correlation between total scores. This conclusion is plausible, since 
elimination of any strongly correlated score components would tend 
to weaken the total relationship. Thus a good approach to deter- 
mining more precisely the effect of elimination of guessing scores 8 
to consider the change in validity resulting therefrom as a function 
of fonos: To this end, let v = r,,’ — ra, so that from (18), (19), and (20) 


v= (risum, + TreT 10v.) (72 — Oz!) — Tiam mo, T, Toon Tout 
, 


Oz 0,0, 9.0y 


Tzy 


(al) 


The function of r,,,, defined by (21) is linear, and examination of 
the coefficient of r,,,, shows that the slope is always negative. Thus 
the larger r,,,, is, the less favorable will be the change in validity 
resulting from elimination of guessing scores. However, if the slope 
is only slightly negative and the v intercept is somewhat positive, 
negative changes in validity may be avoided even for fairly sub- 
stantial positive values of r,,,,. This situation is illustrated in Figure H 
where the function defined by (21) is graphed under two sets 0 
values resulting in different slopes and intercepts. 5 

It is easy to see that the slope of the function defined by Q1) vil 
be most nearly horizontal when either c,, or cp, is near zero. ^' 
course c,, will be of substantial size in any case considered here; 
otherwise there would be no interest in eliminating it. Howeve? 
criterion scores may not have a guessing component. In this cas 
sa; = 0, and the function defined by (21) is constant with re 
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v(change in rz) 


Toros ——— 
Values 
Variable A B 
91, 25 25 
Tos 15 10 
Tuo SN -5 
Ce," 15 8 
Os 22.4 23.2 
vt 29.2 26.2 
oy 50 25 
Soy 20 10 
Tio," T 3 
Ca’ 15 8 
ey 41.5 30.6 
Tisoy — .63 1b 
Tio, — .63 =125 
Tut, 9 5 


‘Required to compute ox, øs’ or øy but not shown in formula 21. 


une l. Validity change resulting from elimination of g+ scores as & funo- 
n of foa, for two hypothetical sets of values for formula 21. 


‘or,,.,. Then whether eliminating g- scores increases validity depends 
only on the value of the v intercept. Thus, regardless of whether the 
ation defined by (21) is horizontal or has negative slope, a key 
"terminer of potential validity improvement is the v-intercept value. 
Tom (21) the v-intercept value is 


Penty, F 76,:,,0:0,0,,) (C2 — 2.) — Tyee Fy? a! , (22) 
0, ys 
Which is equivalent to 


‘eas » 
Tz Oa! LERNTE UA E Piroti — Leales ps i (23) 
Oz 0,0, 0,0, 
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The variation of (23) may be analyzed conveniently in ten 
two of its variables, r,,,. and r,,,,. For any given situation, the va 
of Tipoa and 7;,,, should be respectively similar to but slightly les 
absolute value than rs.. and Tio allowing for a less than pe 
positive relationship between the two sets of scores. Then, be 
Tree and Tipo, depend on instructions regarding guessing and 
quent guessing behavior, it will be possible eventually to charae 
according to guessing behavior on the predictor and criterion 
the circumstances under which elimination of guessing variance 
enhance validity. 


Case 1: Tepos < 0 and tisoy < 0 


This Case could result from random guessing on both tests ¥ 
ever an answer is not known. Tso, < 0, Tisoy, < 0 and c, — Oai 
are expected, the latter in view of comments made earlier regati 
the change in total variance upon elimination of guessing 8 
(see Variance of Multiple-Choice Test Scores). The fraction ( 
a,')/c,’ tends to be small in absolute value, and r,,,, < 0 fw 
contributes to a small absolute value for the first term of (28); 

The subtraction of a substantially negative second term in 
tends to result in a positive value for the v intercept. This situ 
is illustrated in Table 1, which gives the v-intercept values correspt 
ing to several sets of hypothetical values for the variables inva 
It is interesting to note that higher v-intercept values result fro : 
larger absolute values of r,,,, and fi», associated with the | 
values of r,,,,. Also, it may be noted that when the guessing var 
is smaller, as for the last 12 lines of Table 1, the v-intercept V 
are lower. 

The prospects for improving validity by eliminating guessing 8 
on the predictor test are not good in this case. The higher | 
intercept of the function defined by (21), the more negative its 
Further, r,.,, is likely to be somewhat positive. The work of Zim 
man and Williams (1965) indicates that r.o, in the range of 2.8 
may be expected, with higher values associated with longer. 
Thus, in this case the best that may be hoped for is a very ™ 
increase in validity, and a decrease is a distinct possibility, xP 
for longer tests. 


jà 


Case 2: r,,, < 0 and Tino, > 0 
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If guessing is random whenever an answer is not known on the pre- 
dictor test and the best prepared examinees are the most successful 
guessers on the criterion test, this Case would result. r,,,, < 0, 
Tisoy > 0 and c, — a,’ < 0 are also expected. Then in (23) (v, - 
c,')/c, would still tend to be small, but the value of 1%,1,0:.01, + 
Tt20,7t,7, Would be much larger than for Case 1. Thus the first term 
of (23) would be more substantially negative with the result that 
v-intercept values would be lower than for Case 1. 

In this Case 7,,,, is likely to be negative, for high scores on the 
criterion test would have the best guessing scores along with the 
reverse situation on the predictor test. Thus the v-intercept values 
near zero are not unfavorable in view of the negative slope. The 
more pronounced the negative relationship between guessing scores 
on the two tests and the larger the guessing variance involved, tht 
greater an increase in validity may be expected if the guessing com- 
ponent can be removed from the predictor scores. 


Case 8: r,,,, > O and Tinos > 0 


Resulting from better prepared examinees having greater guessing 
success on both tests, this Case would coincide with ri, > 9% Tu ? 
0, and c, — c,' > 0. Then both terms of (23) would be positive 
While the second term of (23) has an absolute value of the same 
magnitude as in Case 2, that of the first term is larger. This resu 
follows because the absolute value of c, — o,’ is larger when fin P 
positive than when it is negative [see (3)]. In consequence, v-interoept 
values lower than for Case 2 may be expected. m 

Unlike Case 2, this Case offers little hope for increasing validity. 
As in Case 1, r,s, > 0 may be expected in view of similar guessing 
behavior on both tests. Then, with the always negative slope of (28) 
and v intercept lower than for Case 1, a negligible, positive ot mos 
likely, negative change in validity may be expected upon elimainatio" 
of g. scores. 


Case 4: ru, > O and Tisoy < 0 


Tf best guessing scores are made by the better prepared examines 
on the predictor test and guessing is random on the criterion m 
this Case would result. r,,,, > 0, r,,,, < O and c, — c, < 0 may be 
pected. The absolute value of the first term of (23) would be a D 
smaller than for Case 3 due to fe, < 0. Thus, substractiom : 
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substantially positive second term results in v-intercept values even 
smaller than for Case 3. 

Even though v-intercept values may be near or below zero in this 
Case, the outlook for improving validity by eliminating g, scores is 
favorable. As in Case 2, a negative r,,,, may be expected, so that the 
negative slope of the line determined by (23) may result in a positive 
v value for a moderately negative value of r,,,,. However, the validity 
improvement would not be as great as for Case 2 due to the smaller 
v intercept. 


Case 5: fayo, < 0 and Tu, = 0 


While r,,,, < 0 could be the result of random guessing on the 
predictor test, r,,,, = 0 could represent either of two criterion test 
situations. One possibility is that the instructions regarding guessing 
may be confusing, nonexistent, or effective for only part of the exami- 
hees. Then, while guessing occurs, the guessing scores are unrelated to 
the true scores. The other possibility is that there is no guessing 
Component (or a nonzero, constant guessing component) for the y 
Scores. In either situation, (23) reduces to 


+ 
Oz — Oz Tuy fut, (24) 
7 b typa’ 
0. 0,0, vids A 


The values taken on by (24) in this Case tend to be between those 
of Case 1 and 2 above. This statement follows because the value 
of 7«,7,,7,, lies between the values of Trey tety F teneo in 
Cases 1 and 2, While values of c, will differ from those of Cases 1 and 
2 (assuming Tisos = 0), the changes are relatively small and affect 
both terms of (24) in the same manner. Table 2 gives representative 
Values for this Case including examples in which cs, > 0 and in 
Which o,, = 0, Note that the sets of values used in Table 2 are 
Otherwise essentially the same as those for Table 1. 

Ih this Case, the intercept value alone determines the change in 
Validity resulting from elimination of g, scores. Since these values 
lend to be positive, prospects are good for improving validity, espe- 
cially if the absolute value of r,,,, is large. Even though this effect 
Would tend to be encountered by a simultaneously large ri, the 
efiet of Tt, is not as strong as that of r,,,, due to the small absolute 
Value of (o, .— c.")|o;' [see (24)]. 


Case . 
06:0, > Oand Tun = 0 
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TABLE 2—CASE 5 
iyos < 0 and Troy, = 0—Hypothelical Values for the Variables of Formula 24 and 
Values for the Expression 
Tt, Te 
and and 


ry, — vinterp 


—.09 3 08 

—.8 20.2 2.2 30.2 —.15 5 i 

—.24 8 Pug 

2B]. 00000 OO ee 
—.15 3 W 

—.5b 233.2 26.2 28.1 —.25 5 D 

=.40 08 E 
dL 
—.09 "8 E. 

—.8 25.3 20.2 28.1 —.15 5 P 

—.24 8 D 

25 8 IX MEER eee 
EE g x 

—.5 23.2 2.2 26.2  —.25 : ‘ 

EF eee D) 

15. 0 — r 3 w 
i 2 0 

15 3 m 

—.5 23.2 26.2 26.2  —.25 i l 

—.40 8 4m 

—.09 4 w 

-.8 3 À 2) -.15 . ; 


This case is similar to Case 5 except that guessing scores On un 
predictor test are positively related to true score on the e 
test. Values for the v intercept tend to lie between those of pe 
and 4, just as those for Case 5 lie between those of Cases 1 and 2. 
Thus, the v-intercept values which may be obtained in this ose 
correspond to smaller increases in validity upon elimination oe 
scores than for Case 5. 


Case 7: Tiso = 0 and rie 0! 


icated 
Here only confusing, nonexistent, or incompletely commune? 


i 
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TABLE 3—CASE 7 


Tho = Oand Tisoy > 0—Hypothetical Values for the Variables of Formula 25 and 
Values for the Expression 


Tu Te Tos 
and and and 
oy Sey Oo iyoy Oe o Os Oy Tuo, s intercept 
.15 3 .034 
-5 30.2 26.2 35.9 .25  .5 .057 
.40 8 .092 
15 
.09 3 .033 
-3 30.2. 26.2 33.7. .15 5 055 
.24  .8 .088 
15 3 .018 
.5 28.1 26.2 32.2 .26 5 .030 
40 .8 .048 
10 
.09  .3 .018 
.8 28.1 20.2 30.6  .15 5 .030 
.24  .8 .047 


instructions would account for r,,,, = 0. The possibility that oo. = 0 
is not of interest as pointed out earlier. fs., > 0 would result from 
higher guessing scores on the criterion test for better prepared exam- 
inees. In this case, (23) reduces to 


a= On! Ti 0100, F Tis Tuo, (25) 
[^ 9.0y 

Here it would be only reasonable to assume that tr... = 0 50 that 
— c, > 0 would definitely be the case. Then a positive v intercept 
would necessarily follow as well. Table 3 shows that for representative 
Values of the variables in (25) the increase in validity upon elimination 
of 9, scores ranges from inconsequential to substantial values 88 rir, 

and r,,,, increase. 


Case 8: Tisos = 0 and Tis, <0 


This Case is similar to 7 except that higher guessing scores for the 
More poorly prepared examinees would be the case for the criterion 
edi Then the term, r,,,,0:,0,, F 71,570,005 1 (25) would tend to have 
“smaller absolute value, so that the v-intercept value would be nearer 
"tto. However, this tendency is offset somewhat by smaller values for 
% than occur in Case 7. 


Gite. 
0869: Tisoy = O and Ti = 0 
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Here, (23) reduces to 


Te — Fs! Test ui. (26) 


Oz 0,0, 
Values between those of Cases 7 and 8 tend to result. However, if 
Ss, = 0, the resulting smaller value of c, in the denominator may 
result in values larger than for Case 7. Therefore validity improve- 
ment resulting from elimination of g, scores could then be substantial, 
especially if r;,,, is large. 


Effects of Variables Not Considered in Cases 1-9 


In order to avoid complicating an already involved situation 
further, changes in error-score variance and true-score variance were 
not considered in the above analyses. However, different sizes of true 
score variance tend to have little effect on validity change. For 
example, if instead of o,, = o,, = 25, it is the case that c, = 25 and 
7,, = 2, validity changes follow the same pattern as presented above. 
This result follows because the size of true-score variance dictates 
within reasonable limitations the size of the other variance com 
ponents for the scores in question. Further, as seen in (23) the true 
score variances are reflected in both numerator and denominator of 
both terms, so that changes tend to cancel each other. 

With respect to error variance, it may be noted in (23) that error 
variance affects mainly the denominators of the two terms. E 
increases in error variance tend to reduce the size of both terms with 
the result that the absolute values of validity changes are redu 
when error variance increases. 


Summary 
Validity improvement upon elimination of g. scores would be an 
substantial in Cases 2, 4, 5, and 7. These Cases all require that ved 
be a substantial difference between the predictor and aieri 2 
Either random guessing is encouraged on one test and restric 


: > res 
another or guessing scores are absent or uncorrelated with true 50 
on one of the tests. 


Discussion } 
eh 0 
In the foregoing development, the question of how to eliminate 


el 
reduce the guessing component of scores has not been pe 
While a great deal has been written on the subject (see De" 


a 
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1965), no very satisfactory method has been found within the con- 
ventional, multiple-choice test format. Of course, some effect may be 
obtained by instituting a penalty for wrong answers, but such mea- 
sures penalize simple misinformation and tend to result in measuring 
personality characteristics of the examinees instead of how much 
they know. 

Acknowledging then that any marked degree of control over the 
guessing component of conventional, multiple-choice test scores is 
not currently feasible, what is to be gained by haying such control? 
The answer to this question depends on the test and examinees. For 
many situations, the results obtained above show that eliminating the 
guessing component of scores would increase reliability only at the 
greater expense of reducing validity. Consider, for example the testing 
situations in which there is a wide range of examinee ability and only 
completely random guessing is discouraged. For such a situation 
^i; will almost surely be negative and probably substantially so simply 
because the weaker examinees have so many more opportunities to 
guess. Then elimination of guessing scores would tend to increase 
Teliability. However, if the criterion testing situation is similar to 
that for the predictor, Case 1 applies, and validity may well decrease. 
Case 2 is extremely implausible for such a group of examinees. Only 
if the criterion scores have no guessing component would both reli- 
ability and validity be expected to increase (Case 5). However, scores 
with no guessing component may actually be quite rare. Even the 
common criterion of grade-point average may have a guessing com- 
t if course grades are determined from multiple-choice test 

cores, 

It should be noted that, throughout all the preceding discussion, 
reliability has referred to the correlation between scores on parallel 
forms of a test. Thus the results obtained are not comparable with 
that to the effect that making a test more consistent internally tends 
to reduce its validity. While increases in correlation between forms 
of a test tend to follow increases in internal consistency, the reverse 
8 not necessarily the case. 
ai derivations presented earlier regarding validity also emphasize 

Importance of knowledge regarding both predictor and criterion 

t parameters before any attempt is made to eliminate guessing 
“cores, Indeed, if r,, > 0, it could be the case that both reliability and 
Validity would be reduced by eliminating guessing scores (Case 3). 
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While it is definitely possible to inerease both reliability and validity 
by eliminating guessing scores (Cases 2, 5, 7, 8, 9), the circumstances 
required for such a change may be difficult to obtain in practice. 
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STABILITY OF APTITUDE SCORES FOR ADULTS 


WILLIAM K. SHOWLER Ax» ROBERT C. DROEGE 
U. S. Employment Service 


Aprrrupe tests are intended to provide information concerning 
relatively lasting characteristics of the person tested. It follows 
that stability of measurement! is an important requirement in an 
aptitude test. To what extent does aptitude test stability deterio- 
rate over time for various types of aptitudes? Two large-scale, 
systematic studies have recently been completed by the U.S. Em- 
Ployment Service to provide such information for the General 
Aptitude Test Battery (GATB). The aptitudes measured by the 
GATB are General Learning Ability (G), Verbal Aptitude (V), 
Numerical Aptitude (N), Spatial Aptitude (S), Form Perception 
(P), Clerical Perception (Q), Motor Coordination (K), Finger 
Dexterity (F), and Manual Dexterity (M). Each of the Aptitudes 
has a mean of 100 and standard deviation of 20 for the GATB 
General Working Population Sample (U. S. Department of Labor, 
1967). The first (long-range) study was conducted in 1959-62 to 
determine the stability of the GATB aptitude scores when the in- 
terval between initial testing and retesting is 1, 2, and 3 years, 
respectively. The second (short-range) study was conducted in 
1965-66 to determine the stability of the GATB when the interval 
between testings is less than one year. 


Method 
Long-Range Study 


Eighteen State Employment Services participated in the study. 
he sample consisted of individuals between 25 and 34 years of 


ae 
‘hei term “stability of measurement" will be used to refer to the relation- 
Minis seen initial test scores and retest scores for a specified group of indi- 
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age at the time of initial testing. The age range 25-34 was chosen 
because it represents the interval during which the effects of mat- 
uration and aging upon GATB aptitude scores appear to be 
minimal (U. S. Department of Labor, 1967). Most of the in- 
dividuals in the sample were employees in local or State Employ- 
ment Service offices. No person who had taken the GATB or was 
familiar with it was included in the sample. At each testing loca- 
tion, those initially tested were divided into Subsamples A, B, and 
C at the time of testing. The three subsamples were tested initially 
with the GATB, B-1002B during the same one-month period be 
fore June 30, 1959, and then retested with an alternate form of 
the GATB (B-1002A) after intervals of 1, 2, and 3 years, respet 
tively. Of the 1,309 initially tested, 896 were available for retest 
and were included in the final sample. Table 1 shows data on ag 
education, and sex for the three final subsamples. The subsamples 
are quite comparable with regard to these basic characteristics. 


Short-Range Study 


Sixteen State Employment Services participated in the study. 
The sample consisted of individuals between 16 and 69 years of 
age at the time of initial testing. No age restrictions were placed 
upon selection of the sample because the possible effects of mat 
uration and aging would be minimal for intervals oi less than on? 
year between initial testing and retesting. Most of the individuals 
in the sample were State employees, Employment Service appli- 
cants, or inmates of penal institutions. No individual in the sample 
was receiving any academic or other kind of training during the 
period covered by the study. No person who had taken the GATB 
or was familiar with it was included in the sample. At each testing 
location, those initially tested were divided into Subsamples A; Bi 


TABLE 1 
Age, Education, and Sex Characteristics of Subsamples in Long-Range Study 


Age Education Per cent 
Subsample M SD M SD p 
A (N = 302) 20.8 3.0 13.8 19 39 
B (N = 288) 29.7. 3.0 13.3 2.0 re 
C (N = 306) 304 2.8 13.0 1.8 
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C, D, and E at the time of testing. The five subsamples were 
tested initially with the GATB, B-1002A during the period, May 
1 to July 31, 1965, then retested with an alternate form of the 
GATB (B-1002B) after intervals of 1 day, and 2, 6, 13, and 26 
weeks, respectively. Of the 2,303 initially tested, 1,705 were avail- 
able for retest and were included in the final sample. Table 2 shows 
data on age, education, and sex for each of the five final subsamples. 
Again, the subsamples are quite comparable with respect to these 
basic characteristics. 


Results and Discussion 


Table 3 shows the means and standard deviations of GATB 
aptitude scores at initial testing and retesting for the eight sub- 
samples in Short- and Long-Range Studies. Also shown are the 
aptitude stability coefficients (product-moment correlations be- 
tween initial aptitude test and retest scores) for each of the sub- 
samples, Tests of differences between correlated means show that 
nearly all of the mean score increases between initial testing and 
Telesting represented in Table 3 are statistically significant (.05 
level), The relationship between size of mean score increase and 
time interval between initial test and retest has been analyzed for 
each aptitude (Showler and Droege, 1967). 

The stability coefficients in Table 3 are in the range of .80 to 90 
for most aptitudes. The coefficients for Aptitudes F and M are 
generally lower than those for the other aptitudes. It will be noted 
that the nine aptitude stability coefficients for Subsample A of the 
Short-Range Study (retested after one day) are consistently higher 
than those for the Subsample C of the Long-Range Study (retested 


TABLE 2 
Age, Education, and Sex Characteristics of Subsamples in Short-Range Study 


Age Education 


ee Per cent 
Serr M SD M SD Male 
A 3 
RINE 1) 319 92 a2 2.2 E 
OQ E con 321 91 $3 22 57 
Daves cone 324 8.8 134 2.2 
meee) 2 1 2.4 54 
EWs 32.8 8.9 E 
e 499) 31.8 8.6 12.5 24 
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TABLE 3 


Aptitude Means (M), Standard Deviations (SD), and Stability Coefficients (rIB) for 
Subsamples in the Short-and Long-Range Studies 
(See Tables 1 and 2 for Number of Cases) 


Subsample 


Aptitudes of the GATB 


(Interval) G 

M Initial 103.1 104.0 102.1 99.4 102.9 112.4 109.2 983 10 
A Retest 107.6 107.0 107.1 106.4 113.0 122.6 118.8 112.8 I 
(1day) SD Initial 19.0 18.0 19.4 19.5 19.2 18.5 21.7 227 9 
Retest 19.9 18.9 22.1 19.6 23.4 21.1 23.2 228 M 

TIR .98  .87 93.1..85 e85  .85  .91 9 
M Initial 103.2 104.2 102.1 100.2 104.4 113.0 105.8 95.7 a 
B Retest 105.8 106.7 104.9 106.2 112.0 122.6 115.0 107.2 1 
(2wk.) SD Initial 17.3 17.8 18.0 18.7 19.2 17.9 22.1 214 Ht 
Retest 19.5 18.6 20.2 19.4 23.3 22.0 23.2 23.4 M 
TIR aa 489.92  .82  .85 81 90. «594 
M Initial 104.3 104.8 103.1 101.0 104.8 113.5 106.6 96.0 : 
c Retest 104.7 105.4 104.2 105.2 110.4 122.1 115.2 100.8 10 
(6 wk.) SD Initial 18.1 17.3 19.8 19.8 20.5 19.9 20.7 719 ff 
Retest 20.9 18.8 22.5 19.2 22.7 22.7 22.6 28 ^| 
Tir Oea  .84  .82  .86  .88 «0! 
l 
M Initial 103.8 104.2 102.6 100.7 105.2 113.8 107.4 97.0 W 
D Retest 106.5 106.6 105.7 106.4 111.6 122.1 113.9 106.6 1 
(3 wk.) SD Initial 17.8 17.5 19.3 19.7 18.8 17.9 21.0 212 5 
Retest 19.7 184 220 199 21.5 213 221 BS % 

TIR COR esq | B0 ^ .84' ` OLTE 
M Initial 104.6 105.2 104.2 101.5 104.6 114.5 106.7 2 im 
E Retest 106.3 106.1 104.9 105.1 110.8 121.5 112.5 ne n 
(@6wk.) SD Initial 19.6 18.9 19.0 20.9 21.0 19.0 221 714 g 
Retest 21.1 19.2 224 20.3 221 203 23.6 753] 
tre a E dace asit 83°) | 70: i: E. k 
1 h 
M Initial 110.2 110.8 109.4 104.8 109.1 119.1 116.8 1054 i 
Retest 114.0 114.1 112.8 109.1 111.5 125.1 124.1 1003 E 
(lyr) SD Initial 17.7 17.6 162 19.0 18.8 16.0 178 213 g 
Retest 16.1 16.0 14.9 21.3 17.5 16.2 17.5 g ] 
Tir E SU 7 .74 85 E 4 
tye j | 
M ‘Initial 109.2 109.2 110.0 103.0 109.3 119.1 115.7 es n 
B Retest 112.8 112.2 113.3 107.0 112.2 124.0 122.0 105 gi 
Gyr) SD Initial 16.9 16.9 163 19.4 17.5 16.7 16.8 25 f 
Retest — 16.8 16.8 14.7 22.1 16.8 17.3 18.0 Mg | 
fin Hé. sn cst cue à 07 0 ME 
M Initial 106.4 106.8 106.2 102.2 107.1 116.4 114.9 B 109 
c Retest 110.7 111.6 109.3 100.0 110.4 120.0 119.2 1009 % 
Gyr) SD Initial — 189 15.9 17.3 19.6 16.4 15.4 18-4 Sjo 7 
Retest 17.5 16.8 15.5 21.0 15.3 183 “73 | 


Tir 


8 


.85 


.88 


.84 
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TABLE 4 
Analysis of Variance for Short-and Long-Range Studies to Test Homogeneity of Regression 
Short-Range Study Long-Range Study 
AD ie 
MS MS 
] Error Interval Error Interval 
Aptitude (af = 1695) (df=4) F (df = 890) (df=2) F 

Intelligence 63.25 30.50 48 62.94 107.50 1.71 
Verbal 78.13 30.00 .38 19.86 54.50 2.74 
Numerical 67.57 42.25 .62 58.82 82.00 .54 
patial 118.65 24.75  .21 14.18 5.50  .39 
Form Perception 158.28 — 802.75 5.07* 119.80 — 85.50 .04 
Clerical Perception 143.71 520.50 3.62* 93.12 71.00 .76 
K—Motor Coordination 95.68 17.75  .19 21.60 30.50 1.41 
linger Dexterity 239.72 330.75 1.38 2291.32 84.00 .38 
-Manual Dexterity 208.30 851.25 4.09* 225.37 16.50  .07 


| 


Note,— Analysis in Long-Range Study was based on raw scores for aptitudes (V, S, Q, 
+ 
P< 01. 


after three years), suggesting that there may be deterioration in 
stability of measurement over the three-year period. 

To obtain information on this point, the F test for homogeneity 
of regression was applied separately to the data for the Short-Range 
and Long-Range Studies with the results shown in Table 4. 

Significant (.05 level) differences were found for Aptitudes P, 
Q, and M, in the Short-Range Study, but no significant differ- 
ences for any of the nine aptitudes in the Long-Range Study. 

Although there are significant increases in mean scores upon re- 
testing after as long as three years for all aptitudes, there is little 
deterioration in stability of measurement over this period after 
initial testing. The implication of this finding is that retesting an 
individual with aptitude tests before the end of three years will 
generally be unnecessary, assumming that his initial test scores are 
Valid, If the individual is exposed to training or experience that 
n be likely to effect his aptitudes, however, it is desirable to 

test. 
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In view of the tremendous advances that have been made in the 
adaptation of electronic computers and accounting machines to the 
processing of statistical data, sections of the Spring and Autumn is- 
sues of EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT are devoted to the publication of such programs as are ap- 
propriate to psychometrie procedures. Programs relevant to such 
problem areas as factor analysis, item analysis, multiple regress- 
sion procedures, the estimation of the reliability and validity of 
tests, pattern and profile analysis, the analysis of variance and 
covariance, discriminant analysis, and test scoring will be con- 
sidered. Customarily a program should be expected not to exceed 
six or eight printed pages. Manuscripts of four or fewer printed 
pages are preferred. Each manuscript will be carefully reviewed as 
to its suitability and accuracy of content. In some instances an 
accepted paper may be returned to the author for possible re- 
visions or shortening. The cost to the author will be twenty-five 
dollars per page for regular running text. The extra cost of the 
composition of tables and formulas will be added to the basic rate. 

Manuscripts received up to November first will be considered for 
the Spring issue; manuscripts received between then and May first 
will be considered for the Autumn issue. 


All correspondence and duplicate manuscripts should be directed 
to: 


Dr. William B. Michael 
325 Callita Place 
San Marino, California 91108. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1969, 29, 689-692. 


GRADUATING OBSERVED-SCORE DISTRIBUTIONS: 


MARILYN S. WINGERSKY, DIANA M. LEES, 
VIRGINIA LENNON, ann FREDERIC M. LORD 


Educational Testing Service 


In practical work, the “Method 20” computer program described 
here will, for the present, probably be used mainly for graduating 
(ie. fitting or “smoothing”) actual test score 
necessary input is simply the actual distribution of test scores. 


distributions. The 


Researchers will be interested in seeing the estimated true-score 


A COMPUTER PROGRAM FOR 
ESTIMATING TRUE-SCORE DISTRIBUTIONS AND 
distributions and, also, in seeing mean true- 
score, and the variance about this mean. At a later date, the pro- 
gram will be revised to produce, also, a measure © 
ing power of the test at each score level. 

Further computations based on true-score 
this program may enable the researcher (Lord, 1965) : 


1. To estimate the frequency distribution of observed scores that 


will result when a given test is lengthened. 

| 2. To equate true scores on two tests by the equipercentile method: 
| 3. To estimate the frequencies in the scatterplot between two 
y parallel (nonparallel) tests of the same psychological trait, 
using only the information in a (the) marginal distribution ih 
(A computer program for doing 

Lees, Lennon, and Lord, 1969.) 
4. To estimate the frequency distribution 


this is desc 


EE — uen m 
s E tween the 
lThis work was supported in part by contract Nonr- 2752(00) | betwee! 


score for each observed 
f the discriminat- 


distributions found by 


ribed in Stocking, 


of a test for a group 


roduction, 


fice of Naval Research and Educational Testing Servi United States 


translation, use and disposal in whole or in part by or for the 


Government is permitted. 


| - 
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that has taken only a short form of the test (this is useful for 
obtaining norms). 

5. To estimate the effects of selecting individuals on a fallible 
measure. 

6. To effect matching of groups with respect to true score when 

only a fallible measure is available. 

To investigate whether two tests really measure the same 

psychological function when they have a nonlinear relation- 

ship. 

8. To describe and evaluate the properties of a specific test con- 
sidered as a measuring instrument. 

9. To estimate the item-true score regression for particular 
items, without strong prior assumption as to its mathemati- 
cal form (Lord, in press (a) ). 


Y 


The true-score distribution is computed using the model de- 
scribed in Lord (in press (b)). This model assumes (a) the con- 
ditional distribution of observed scores for a fixed true score is a 
(certain approximation to a) compound binomial distribution 
and (b) the true-score distribution for the group tested is smooth. 


The model defines the estimated proportion of examinees with 
observed score z as 


do = f Oelde for 2-01,» O 


where n is the number of items, h(z|f) is an approximation to 4 
compound binomial, and ĝ(¢) is the estimated true-score distri- 
bution. This true-score distribution is estimated using observed- 
score frequencies that have been grouped to decrease sampling 


fluctuations, The formula for the estimated true-score distribution 
is 


U 
96) = »» X X AG |o, Q) 
where Js; denotes summation over the observed scores x contained 
in the uth 


class interval, adjacent observed scores having been 
grouped into U class intervals. Substituting this formula for 9(¢) 
into equation (1) gives the estimated observed-score relative fre- 
quencies as functions of the estimated A’s. These A's are estimated 
from the actual observed-score relative frequencies f(x) using 4 
maximum likelihood technique. The maximum likelihood estimation 
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of the A's is restricted by the requirement that all of the A's be 
non-negative. This requirement is necessary to prevent the occur- 
rence of “negative frequencies” in the estimated true-score distri- 
bution. The formulas and the procedure for computing the lambdas 
are developed in detail in Lord (in press (b)). 

A chi-square is computed between the fitted observed-score dis- 
tribution and the actual observed-score distribution to test the 
goodness of fit. 

Restrictions. The program is written in FORTRAN IV for the 
IBM 7044 using the IBSYS monitor modified to allow unlimited 
underflows. The program is being converted to FORTRAN IV (G) 
for the IBM 360. This program sets the limits a and b of the integral 
in 4(z) equal to 0 and 1 respectively as this is sufficient for most 
applications of the program. (A program that allows the user to 
specify a, b, and also the parameters of a smoothing function exists, 
but has not been documented for general distribution.) 

The maximum number of items per test is 100. The maximum 
number of class intervals into which the observed scores are grouped 
is 26. This program may produce unsatisfactory results if used on 
frequency distributions of less than a thousand cases. 

Input. The quantity of input depends on the degree of control 
over the estimating procedure wanted by the user. The minimum 
input necessary is: 


1. Printout identification, e.g. title. 

2. The number of test items. 

3. The number of examinees. 

4. The frequency distribution of observed scores. 


The user can have more control over the final smoothed distribu- 
tion by supplying the grouping of the observed-score M 
He can also supply the variance of the difficulties (proportions 5 
correct answers) of the items in the test. This last is helpful but i^ 
essential for approximating the compound binomial in equation C ). 

Output. 1. The first section of output contains statistics describing 
the observed-score and the true-score distributions. These 2 
ties include the mean, first four central moments, and measures sh 
skewness and kurtosis for both observed scores and true scores; 
also the Kuder-Richardson reliability coefficients KReo and KRa 


for observed scores. 
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2. The smoothed observed-score distribution is printed; also the 
chi-square between the smoothed and the actual observed-score 
distributions, together with the probability level of the chi-square. 

3. The mean true score for each observed score is printed; also 
the standard deviation of true scores about this mean. This gives 
the regression of true score on observed score ; also a measure of 
the scatter about this regression. 

4. The output lists relative frequencies and also shows graphic 
plots of the observed-score distribution and of the estimated true- 
score distribution. 

Timing. The program takes from one to ten minutes, the amount 
of time depending on the actual distribution being fitted. 

This program may be obtained from the authors, Educational 
Testing Service, Princeton, N. J. 08540. 
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| TEST PROCESSING AND REPORTING PROGRAMS 
FOR THE IBM $360/20 


RICHARD T, JOHNSON 
American Institutes for Research 


| In most testing programs, a great many processing steps intervene 
‘between test scoring and the reporting of results. Although optical 
aders, such as the IBM 1230, can easily be programmed to handle 
e vast majority of simple scoring tasks, it is more difficult to set 
"the additional intermediate steps necessary to ensure correct, 
sily interpretable results. 
Steps in computer processing of examination 
e Or more score cards containing raw scores an! 
tification numbers are produced as a first step. In a 
master cards with the same identification number, and probably 
candidate’s name. In such situations, common elements in the 
puter processing of most examination results include: (a) edit- 
E the scores on each paper to ensure that all scores are present 
Ora candidate absence is noted, scores are in the proper order, and 


each candidate is accounted for; (b) ascertaining that each score 
sfooting for each 


from the various 


as determined by the demands of the situation; (d) combining 


scores. In general, 
d candidate iden- 
ddition, there 


k 


mmary statistics for use in trans- 
trary distribution; (f) punching 
’s master name card; (g) 
djusted final scores; and 
b, alphabetically, or in 


mming the scores to some arbi 
new adjusted scores in the candidate’ 
Oviding frequency distributions of the a 
X (h) listing the candidates in order of meri 
ne other order. 

Since each of the eight steps li 


693 


sted above is composed of several 
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discrete separate steps, the process involves literally dozens of card 
handlings with the consequent likelihood of errors. In addition, a 
separate program is generally written for each step in each ex- 
amination to be processed, and the extent of data editing is typi- 
cally minimal. 

Two computer programs. A systems analysis indicated that 
steps one through five could be included in the specifications for 
General Purpose Program I (GPP-I) , and steps six and seven 
included in General Purpose Program II (GPP-II). Step eight, 
the listing of candidates, can be done with one of the IBM-supplied 
Utility Programs, and presents no particular technical difficulties. 
Descriptions of the two programs follow. 

For both programs, it is assumed that no scores are negative, 
and that all scores are in the range of 0 to 9999. Card and column 
designations are uniform throughout; the form is always NAAB 
where N is the card number within case; I CN <9 


AA is the leftmost column number of the field; 01 < AA < 80 
B is the width of the field in columns; 1 < B < 4 


As an example, suppose that each candidate had three cards, and 
his arithmetic test score were punched in columns 45-47 of the 
second card. The score would be described as 2453, or card 2, start- 
ing in column 45, and 3 columns wide. 

For GPP-I, the cards are sorted so that all cards for a candidate 
are together, with the master card following the test score cards. 
No special order is assumed for GPP-II. 

General Purpose Program I. GPP-I includes major steps one 
through five, with information concerning the data supplied to the 
program through three types of parameter cards. All editing checks 
(step one above) are made using the logical mode which ensures 
absolute equivalency, and all limiting checks (step two above) are 
made using arithmetic mode which ensures identification of blank 
scores. 

The first type of parameter card which the User must prepare 
specifies up to twenty separate checks to be carried out. If any 
single one is not satisfied, a Message and error card image are 
printed, all card images are moved up one position in memory, and 
the check sequence is tried again. This is necessary since one OT 
more cards may be missing, and the Program must continue search- 


D 
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ing until it finds the first card of the next candidate; it may result 
in multiple error messages. 

Checks can be made to determine whether (a) a certain code is 
punched in a field (e.g., to designate a particular test or card), (b) 
a field is blank, and (c) all cards presumed to be from the same 
candidate all have the same identification number. If all conditions 
are satisfied, the Program will continue with the next step as out- 
lined below and will place the correct master card in whatever 
hopper is designated. 

The second type of parameter card specifies up to a maximum 
of twenty fields to be checked, and the maximum and minimum 
permissible scores. Any scores outside the range, including blanks, 
cause an error message and the card image to be printed; the case 
is then flushed. 

The third type of parameter card supplies information concern- 
ing the calculation phase as outlined in steps three through five 
above. The basic NAAB designations referred to earlier are strung 
together with operation codes, including +, ~, *, /, =, and actual 
integer values in a manner somewhat analogous to Fortran arith- 
metic statements. Thus one can crossfoot, weight tests unequally, 
obtain ratios of one score to another, calculate averages of a num- 
ber of subscales, find difference scores, and carry out many other 
activities, 

A total of forty operations is pos 
by an equal sign in an “equation” is use 
well as punching. The statistics produced are means, standard de- 
Viations, variances, covariances, intercorrelations, and the number 
of cases in which the data have passed all previous editing 
checks, (c.f., Intercorrelation Matrix, IBM 360D-13.6.002) j 

All punching is done on the last card processed for each candi- 
date, except for dummy punching where only statistics TA desired: 
A scan is made of the output image before punching is begun m 
order to ensure optimum speed on the MFCM. Calculations are 


done in strict sequence, and any intermediate results are aiti 

to fifteen digits although only a maximum of four can be m D 

General Purpose Program II. GPP-H includes major steps 
needed. The program was 


and seven, with only one parameter card 
designed in conjunction with GPP-I, and uses the output from it, 


although it can also be used separately: ‘All cards punched by the 


sible, and any field designated 
d for output statistics 88 


696 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


previous program have been checked; nevertheless a few addi- 
tional checks are made, and these cannot be overridden by the 
user, 

lt is assumed that only a single summary card exists for each 
candidate, and that transformed results are to be punched into it 
or à dummy card if only statistics are wanted. It is also assumed 
that the new means and standard deviations are positive only, 
All digits to the right of the decimal place are truncated in the 
final punched output. 

The parameter card supplies information concerning the old and 
new means and standard deviations, and where they are to be 
punched. Any number resulting from the transformation, punched 
or dummy, which has a width of two columns will be included in 
the output of frequency distributions (c.f, Double Digit Count, 
IBM 360D-09.0,002). Statistics on single digit output fields such 
as stanines can be obtained by also placing the result in a dummy 
two-digit field. 

A maximum of twenty transformations is allowed per job, and 
all are of the form Y = a + bX where Y is the new score, and X 
the old one. The a and b are constants derived within the program 
from the old and new statistics using conventional textbook for- 

Summary. Two general purpose programs were written in basic 
assembler language for the Processing steps between test scoring 
and the final reporting of results, applicable to most testing situa- 
tions. A minimum of parameter cards supplies information for 
completely flexible input, Processing, and output. 

The two programs have been in use for approximately six months 
at the West African Council, Lagos, Nigeria, and 
Were developed for the American Institutes for Research which has 
^ project there, 

Copies of the source decks and an ded writeup are avail- 
able by writing to Dr. Richard T, o Sa Institutes for 
Research, 135 North Bellefield, Pittsburgh, Pennsylvania 15213. 
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ÜKAGE: A SYSTEM OF COMPUTER ROUTINES FOR 
A THE ANALYSIS OF CORRELATIONAL DATA 


JOHN E, HUNTER 
Michigan State University 


AND 
STANLEY H. COHEN 
Computer Institute for Social Science Research 


4 


"this period of large capacity, high-speed computers, there 
i be found at almost every computer installation a program 
"Wy containing an assortment of routines to analyze 
nal data. However, few offer a program system composed of 
ral; linked routines. And usually either the system is not trans- 
le because some of its elements are written in machine or 
ambly code or the system consists of & complex group of 
Is that prevent the novice from fully utilizing it, Moreover, 
ifying progranís within the system or the program 
tion requires a large investment of time, money, and energy. 
lie present, writers believed a system of correlational j 
grams could be developed that was (1) transportable from in- 
relatively simple set of 


lation to installation, (2) made up of a 
fol card options, (3) easily modifiable or expandable g 


flexible in its interconnected routines. The result Neng 
All of the subroutines in PACKAGE were vd uae 
ixtended FORTRAN. The main program reads 
‘ol cards as well as directs the execution 
s. At the completion of each analysis, the output (usually ® 


is available for a subsequent routine. 

‘Are preserved throughout ^ run, the user is allowed to identify 
operate on particular subsets of varia! 

al correlation matrix has been reordered or 
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TABLE 1 
Descriptions of the Major Ruutines in PACKAGE 
IEEE LLL. Q,BQ "VA)AB00C&»'CVAaMAlw—— 


Routine name Features 


CORR Computes means, standard deviations, and 
product-moment correlations. A user-supplied 
transformation section permits dropping of 
Observations, creation of variables, as well as 
other variations from standard input. 


READR Furnishes utility routine for the input of corre- 
lation matrices in BCD or binary mode from 
punched cards or magnetic tape. 


COPYR Provides utility routine for the output of the 
correlation matrix in core memory onto & user- 
specified external storage device. 


PUNCH Affords utility routine for the output of the 
correlation matrix in core memory in punched 
card form. 

REORDER Forms subset of the variables in the correla- 


tion matrix in memory and arranges the vari- 
able in the order specified by the user. 


ORDER Orders the correlation matrix in core memory 


by linking each variable with the one with 
which it most highly correlates. 


REFLECT Reverses the signs of designated variables. 
MGRP Performs an oblique multiple-groups analysis. 
Groups are specified from those comprising 
the correlation matrix in core memory. The 
-group variables along with the original vari- 
ables are saved in core memory. 

SQRTA Performs a square-root analysis on the corre- 
lation matrix in core memory. 


MAXDEC 


Represents variation on square-root analysis. 
A stepwise procedure incorporates variables 
that maximally account for the residual vari- 
ance left after variables are partialed out. 
DUMMY Permits the user to incorporate his own pro- 
gram within the system. DUMMY allows the 
user to test experimental routines that might 
later become standard system programs or to 
utilize other programs whose permanent in- 
clusion in the system is not warranted because 
of infrequent demand. 


Note.—With the exception of MAXDEC i i GE can 
all routines in the present version of PACKA! 
forse, eia Do eo riables on a CDC 3600 Computer with a 32K core memory. This bound, 
urse, can be easily modified for other installations and configurations. 
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Certain programs have been designated as utility routines. These 
routines copy, read in, or generate the current correlation matrix in 
core memory and thus enable the user to analyze several matrices, 
in whole or part, within the same run. 


TABLE 2 


Control Cards for a Sample Job on PACKAGE 


Control card 


Explanation 


CORR, NV = 50, NS = 676, 
LU = 60, TRANS* 


VARS(1-80) 


(60F 1.0) 
data deck is inserted here 


PUNCH* 


COPYR, LU = 20* 


REORDER* 
VARS(1-10, 15, 19, 76-80) 


SQRTA* 


READR, LU = 20, NV = 80* 


MGRP, NG = 4* 
GRPC(1-5) 5 
GRPC(6-10) 
GRPC(11-50) 
GRPC(51-80) 
REORDER” 


ARS(501, 502, 503, 504, 1-80) 


SQRTA* 
END* 


User has requested routine CORR. Input data 
on 50 variables from 676 subjects are to be 
read from the card reader (unit 60). There is 
one format card and a transformation sub- 
routine is present. 


Correlations, means, and standard deviations 
are to be caleulated for variables 1 through 
80. (Thirty variables were created in the trans- 
formation subroutine.) 


Format card. 


The correlation matrix is to be output in card 
image form. 


The correlation matrix is written on unit 20. 


A correlation matrix is to be formed and 
arranged from the list given on the next 
card(s), (This matrix is now in core memory 
and the previous matrix is destroyed.) 


A square-root analysis is to be performed on 
the matrix in core memory. 


An 80 X 80 variable matrix is to be read 
into core memory from unit 20. (This instruc- 
tion returns the original correlation matrix to 
core memory.) 


User has requested an oblique multiple groups 
analysis. There are four groups and these are 
specified by the control cards that follow. 


(see above) Variables 501-504 are the four 
group variables or factors created in the pre- 
vious analysis. 


(see above) 


—— The job is terminated. 
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Table 1 summarizes the features of the major routines in PACK- 
AGE. Control cards for a sample job are illustrated in Table 2. 
A source deck, listing and technical report for PACKAGE can 
be obtained from the senior author upon request. 


—— 
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PROGRAM TO COMPUTE PREDICTION INDEX 
SCORES FOR USE IN ASSESSING THE 
PREDICTIVE EFFICIENCY OF TESTS 


ELAINE L. WALKER 
Raney Technical College 
S.W. Australia 


Psycuoxocists often need to make predictions for individuals 
about the likely outcome of a course of action; e.g., whether they 
will pass or fail in a particular course, if they enrol. These pre- 
dictions are usually made by classifying individuals into two or 
oe categories; e.g., those for whom one would predict pass or 
failure as the more likely outcome. Frequently, psychological 
tests are used as a basis for classification and a particular test is 
chosen because of its predictive validity in similar situations. How- 
d as Meehl and Rosen (1955) have shown, the predictive ef- 
ficiency of a cutting score on a test, for correctly classifying 
individuals, depends jointly on the test’s validity and on the dis- 
tribution of the criterion variable in the population in question 
(base rate). 

p and Rosen have developed formulae for computing pre- 
lon index scores based on Bayes’ Theorem, which show the 

Afficiency of a cutting score on a test for classifying individuals in 

^ population with a known base rate. Various cut-off points may 
chosen to maximize the correct classification of:— 

(a) all hits (true positives and true negatives). 

(b) hits for true DOSE 

(c) hits for true negatives. 

,, Prostam in FORTRAN IV for an IBM 1401 with 16K storage 
and Pa developed to calculate prediction index scores using Meehl 
ed pon formula. This program has been developed for use in 

educational institution for prediction of pass and failure in 
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examinations, but it would be possible to use the program in any 
setting, by substituting positive for pass and negative for failure. 


INPUT 

1st Data Card 

Columns 1—10 Total number of passes in the sample-format T10. 

Columns 11—20 Total number of failures in the sample-format 
110. 

Columns 21—30 Grand total in sample-format 110. A 

Columns 31—40 K, the number of class intervals of predictor 
test scores. This must be less than 10—format 
T10. 


2nd Data Card 


The frequency count for each class interval of the predictor test 
for students passing in the criterion, beginning with the me 
in the class interval with the highest predictor test scores. Forma 
1018. 


8rd Data Card 


The frequency count for each class interval of the predictor test for 
students failing in the criterion, beginning with the frequency 
the class interval with the highest predictor test scores. Forma 
1018. 


4th Data Card 


The score forming the lower limit of each class interval, ies 
with the lower limit of the class interval with the highest predic 
test score. Format 10A4. 


5th Data Card 


The score forming the upper limit of each class interval, DE 
with the upper limit of the class interval with the highest predic 
test score. Format 10A4, 


6th and 7th Data Cards 
Heading, may be blank. 
8th Data Card 


. t 
Columns 3 and 4—base rate, expressed as a proportion, such tha 
proportion passing and proportion failing = 1.00. 


ELAINE L. WALKER 703 

A listing of the program is available, on request, from Miss E. 

- Walker, Guidance Office, Sydney Technical College, Broadway, 
Sydney, N.S.W. 


REFERENCE 
Mechl, P. E. and Rosen, A. Antecedent Probability and the Effi- 
ciency of Psychometric Signs, Patterns, or Cutting Scores. 
Psychological Bulletin, 1955, 52, 194-215. 
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AN AUTOMATED DEMONSTRATION OF 
VARIOUS SAMPLING DISTRIBUTIONS' 


PAUL A. GAMES 
The Pennsylvania State University 


PnosanuY most statistics instructors would agree that mastery of 
sampling distributions is crucial to an understanding of statistical 
inference, However, this concept is often difficult for many begin- 
ning students to grasp. As an aid to understanding, some books 
illustrate all possible samples for a small population (Dixon and 
Massey, 1957; Games and Klare, 1967), while others suggest that 
students engage in small sampling experiments to illustrate various 
sampling distributions (Blommers and Lindquist, 1960; Li, 1964). 
The former demonstrations suffer from the artificial nature of the 
Populations used, while the latter are excessively time consuming 
and highly likely to have computational errors. The program de- 
scribed provides an effective sampling demonstration that can be 
Used to illustrate, e.g., the central limit theorem or the effect of 
increased sample size on the standard error of the sample mean 
and/or variance. 

Program characteristics. ‘The program and supplied data deck 
enable the user to specify: (a) which of five populations he will draw 
from; (b) the sample size he will use; (c) the number of samples 
desired; and (d) the length of output he wants: short or long. The 
Program uses a pseudo-random number generator to select random 
samples from a large population of discrete measures. The five 
Populations are described in Games and Lucas (1966) and Lucas 
(1964). "They consist of a discrete approximation to the normal 
curve, a symmetric highly leptokurtic population, and three popula- 
es 


* This ing ti lied by the Computa- 
3 Program was su ported by debugging time Supp $ ompu 
Won Centers of Ohio University and The Pennsylvania State University. 
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tions of varying degrees of skewness. The short output consists of 
the descriptive indices and histograms of :(a) the population from 
which one is sampling; (b) two samples from the above population; 
(c) the sampling distribution of the mean (X) with the same base 
scale as the population; (d) a repeat of the above with expanded 
baseline to show form. The long option includes the above plus: 
(e) the first 2100 raw score (X) values to contrast the frequency 
distribution of X with that for X, (f) the sampling distribution of 
the sample variance, s* = 97 (X — X)?/(n — 1), (g) a linear trans- 
formation of the above to approximate a Chi-Square distribution 
(when sampling from the normal distribution only—other popula- 
tions will yield diserepent forms); (h) the sampling distribution of 
the sample range; (i) the sampling distribution of the ratio of odd vs. 
even sample variances (an F distribution when sampling from the 
normal population); and (j) the sampling distribution of the sample 
standard deviation (this being used to illustrate that M, ~ VM’) 

After either output series, the sample size may be changed, or the 
number of samples increased. Only the population used must stay 
constant on a given run. 

Program language. The programs are written in FORTRAN 
VI, E Level. They consist of a main program and several subroutines. 
The two major subroutines used, TABIM and PHIST? are the 
author's revision of two subroutines from the IBM Scientific Sub- 
routine Package (1967). They may profitably be employed in other 
computer problems requiring the tabulation of a frequency distri- 
bution from a vector of measures, or the printout of a histogram. 
The only subroutine that is machine-specific is the random number 
generator. This is designed to function on any IBM 360 computer, 
but should be replaced on other machines, Printouts and write-ups 
may be secured by writing to the author. 
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FORTRAN PROGRAMS PROVIDING WEIGHTS 
IN SURVEY DESIGNS USING STRATIFIED SAMPLES 


JOHN A. CREAGER 
American Council on Education 


Wuen performing sample surveys in educational research, vari- 
ous kinds of weights may be required to make appropriate estimates 
of population parameters from the data obtained in the survey 
sample. A discussion of the issues involved in a practical research 
setting is presented elsewhere (Creager, 1968). The programs de- 
scribed in this note were developed in the Cooperative Institu- 
tional Research Program of the American Council on Education 
(Astin, Panos, and Creager, 1966). 

Program GENWTS is designed to supply a complete set of 
Weights for data obtained from within-institution sampling units 
(eg, students, faculty, or administrators) and for institutional 
summary data, Program INSTWTS is designed to supply weights 
for data obtained from institutions as the data unit. Both pro- 
grams are written in FORTRAN IV for the CDC 3600, but may 
be easily adapted to run on other equipment. 

Both programs assume a defined propulation of institutions, which 
Population has been sampled in accordance with some specified 
stratification design. It is not necessary, and will seldom be true 
veu that the data have been obtained from random samples 
P each stratification cell. For various technical reasons it may be 

esirable to oversample in some cells, and, in any case, not all 
sampled institutions may be able to participate in the survey. 
pa finition of weights. Four kinds of weights are generated in 

iem GENWTS. They are defined as follows: 
"nid I. Institutional cell weights are computed for each cell as 
"3 Tatio of the sum of within-institution data units across the 
Pulation institutions in that cell to the sum of the within-in- 
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stitution data units across the sample institutions in that cell. This 
type of weight is constant for all sampled institutions in a given 
stratification cell. 

Type II. Within-institution sampling weights are computed for 
each institution in the sample as the ratio of the total number of 
sampling units in that institution to the number of sampling units 
from which data have been obtained. This type of weight corrects 
for random deviation from 100 percent participation of data units 
within an institution. 

Type III. The third type of weight is the product of the first 
two kinds of weights. It would normally be applied to subsequent 
processing of data records developed from the within-institution 
sampling units. 

Type IV. Institutional weights, appropriate for subsequent pro- 
cessing of institutional unit or summary records, are computed for 
each cell as the ratio of the number of population institutions to 
the number of sample institutions in that cell. 

Program INSTWTS generates only type IV weights. 

Features common to both programs. Both programs can be used 
with any stratification design not exceeding 35 cells, provided all 
institutions in the defined population can be uniquely assigned to 
a cell, and the sample institutions constitute a subset of the cells 
stratifying the total population. The program assumes the same 
institutional I.D. number system for the population and the sam- 
ple. If a given application results in finding some cells inade- 
quately represented, resulting in too large cell weights, the user 
may combine data across cells, where appropriate, using the KO- 
LAPS option. 

Each program uses two input, and two output BCD tapes, un- 
labeled and unblocked. One input tape specifies sample informa- 
tion; the other specifies population data. These are in variable 
. format. One output tape is a system tape on which is written the 
Tun parameters, variable formats, population and sample cell counts, 
results of the collapsing option (if used), programmed diagnostics, 
and running time, 

The main output is an unblocked, unlabeled BCD tape for offline 
listing, punching, or subsequent usage as an input tape. Output 1°- 
cords are in fixed format in GENWTS, but in variable format in 
INSTWTS. If the collapsing option has been used, the output re 
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cords show the final receiving cell number and the final set of cell 
weights. The output records contain institutional identification 
numbers and 100 times the weights. 

For either program the internal execution time on the CDC 3600 
is about one and one-half minutes with little or no collapsing 
operations. Extensive collapsing operations require an additional 
half minute. These estimates are based on execution runs with 
populations of about 2,000 institutions and about 400 institutions 
in the sample. 

Extensive use of “comment” cards in both source decks facilitates 
reading of the programs and preparing execution runs. Cell collapse 
requirements are specified in the parameter card; multiple collaps- 
ing of cells (e.g., cell 3 into cell 2, and then into cell 1 to combine 
all three cells) is possible. 

Features unique to the two programs. In addition to the fact that 
Program INSTWTS provides only the type IV weight, the pro- 
grams have other minor differences in detail, as described in the 
program documentation, One major difference is that Program 
INSTWTS permits the user to input a data record for each institu- 
tion and output that record with the computed institutional weight 
and cell number added to the original record. This permits the 
direct use of the output record in subsequent data processing. Al- 
though this option is not available in GENWTS, the user having 
to merge the set of weights with his data files, Program GENWTS 
has certain other available options, which are suitable for process- 
ing data from within-institution record units. For example, a con- 
stant set of weights may be developed for application to a variable 
number (1-9) of subgroups within institutions and for the total 
Mite sample. In addition GENWTS provides an option for 
iners differential weights for two subgroups of within-institu- 
ional sampling units (e.g, male/female, full-time/part-time, or 
tenure/nontenure). 

Boianiuy. Copies of the source program deck, and detailed 
E documentation may be obtained for nominal cost of re- 
Eon and mailing. A population input tape giving institu- 
in E aracteristics is also available for those planning surveys 
Off er education, Requests should be addressed to the author, 

ce of Research American Council on Education, 1 Dupont Circle, 

ashington, D. C. 20036. 
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A FLEXIBLE PROGRAM FOR KEYING AND 
STANDARDIZING SCALES DEVELOPED FROM THE 
ITEM POOL OF THE MMPI 


RHODES C. YOUNG an» GORDON ADAMS 
University of California, San Francisco 


Tux proliferation of theoretically interesting new scales for the 
Minnesota Multiphasie Personality Inventory (Block, 1965; Try- 
on, 1966; Wiggins, 1966), has made computer scoring mandatory 
for the use of the instrument in any training or research center. 
Hand-scoring cannot efficiently provide the important available 
altematives of item keying. The following program has been de- 
Veloped in response to the diverse needs of psychologists at the 
Langley Porter Neuropsychiatric Institute on the San Francisco 
campus of the University of California. The program deck, LPMM 
PI, written in FORTRAN IV for the IBM 360/50, is available 
{rom the Computer Center, U. C. Medical Center, San Francisco, 
California, 94129, 

Output options. The program output options are: (a) printout 
of the standard profile in duplicate and of a listing of as many as 

00 extra scales with T-score transformations, headed by 31 digits 
of subject-identifying data read from the test answer sheet; (b) 
any of the above information punched onto cards in addition to 
the 566 item responses whether blank, true, false, or multiply 
marked; (c) printout and/or card punch of the Goldberg psychotic 
indices (1965) and a runs test of the response sequence to gauge 

attentiveness of the subject to item content; and (d) storage of 
MT the above information on magnetic tape or disk with re- 

I al by subject number and test date. 

“Nput features, Input is by means of a general purpose question- 

form (IBM H94940) that is read by an optical scanner 
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(IBM 1232). The card punch connected to the optical seanner 
produces four column-binary cards per answer sheet. "These cards, 
along with a deck of control cards, are entered into the computer 
through a card reader having the column-binary feature and are 
translated into numerical codes by a program SCANCARD. These 
codes are stored on magnetic disk for use by the scoring program 
which, for efficient use, is also stored on disk. An alternate input 
method uses hand-punched response cards with numeric codes for 
the response alternatives. 

Other characteristics. During the computer processing of the | 
item responses, scale "templates" are stored in the computer. The 
templates can be introduced from cards with the other input, or can 
be retrieved from storage on disk, Each item included in a scale is 
matched to the corresponding item of a subject’s responses, and 
counts of matches are kept for later output. T' scores are calculated 
separately for each sex and are based on means and standard de- 
viations which accompany the scale templates. In the calculation 
of the runs statistic the total number of items answered true and 
the number answered false are counted along with the number of 
alternations between answering true and answering false. The pro- 
gram on the IBM 360/50 scores about eight subjects per minute, 
following approximately two minutes of set-up time. 
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COMPUTER PROCEDURE FOR THE ANALYSIS OF 
| VARIANCE IN A THREE-FACTOR EXPERIMENT 
WITH REPEATED MEASURES IN ONE OF THE 
THREE FACTORS 


YOUNG B. LEE, ROBERT SMITH, 4x» WILLIAM B. MICHAEL 
University of Southern California 


À three-factor experimental design with repeated measures in 

one of the factors has been demonstrated to be quite useful in 
Psychological and educational research. Investigations by Meyer 
and Noble (1958) and Lee (1968) were experiments which utilized 
this design, 

The purpose of this paper is to detail the computer program for 
the design. The program is written in FORTRAN IV language for 
Processing by a Honeywell 800/400 computer. 

Procedure, The program deck must have these control cards in 
order to implement the analysis: 

(1) Problem card. Four parameters are punched in the 413 for- 

mat; the levels of Factor A and Factor B, the number of 
repeated measures in Factor C, and the number of replica- 
tions, assuming each cell has an equal sample size. A title 
for the experiment may be punched by alphanumeric letters 

1n column 13 to column 76 inclusive. 
(2) Format card, The F-type variable format card, which de- 
Scribes the input data, may be punched anywhere on the 
card. The format card should show the designation (nF 
3.0) where n is the number of repeated measures in Factor C. 
(3) Data input deck. Score data obtained for the Factor C are 
prepared in accordance with the specified format (Card 2). 
The program provides an analysis-of-variance table including 
Sums of Squares, degrees of freedom, mean squares, and appropriate 
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F-ratios. This table is a replication of one shown in Winer's (1962, 
p. 341) book. 

The title of the experiment will be printed at the top of the table. 
Means, standard deviations, standard errors of the mean for each 
cell, and correlation matrices for the repeated factors are included 
in the output. Any number of jobs can be run sequentially. Each 
new job is signaled by the advent of a new set of control cards, 
beginning with the problem card (Card 1) and continuing through 
the data input deck. The program is terminated by the *Endfile 
card at the end of the last job. ^*Endfile" is punched in columns 
1to 8. 

The program provides a maximum of 25 levels each for Factors 
A, B, and C. Moreover, specifying a numeric 1 for Factor B on 
the problem card will reduce the program output to a two-factor 
experimental design with repeated measures in the second factor. 
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PROGRAM FOR “SORTING SHEEP FROM GOATS 
WITH ONE YARDSTICK” 


ELAINE L. WALKER 


Sydney Technical College 
Australia 


Wun deciding which test to use for prediction of a given cri- 
terion, psychologists, traditionally, have selected tests with the 
highest validity. Frequently in applied psychology, a test is used 
to classify individuals for the purpose of deciding on a course of 
action; e.g, to employ or reject, to admit to a course or not to ad- 
mit. Tests, then need to discriminate between groups but, rarely, 
even in tests with high predictive validity, do they discriminate 
accurately throughout the whole range of scores. 

Gregson (1964) has set out an algorism for a test efficiency index 
based on an application of information theory and decision theory. 
The test efficiency index provides a measure of the uncertainty 
Present, when the test is used to discriminate between groups and 
indicates the chances, a priori, of having to use a second test to 
classify individuals efficiently into groups. 

Tests, which discriminate between groups efficiently on an overall 
asis, may yet fail to classify individuals efficiently in certain class 
intervals; e.g., tests often discriminate well in the upper and lower 
e OÍ scores but not very well in the middle range. Gregson 
ii has provided for this possibility in his algorism. Effici- 
ae Scores and likelihood ratios are calculated separately for each 
20 enn the test is efficient in that class interval and the 
in ood ratio for that class interval exceeds a value, Z, then 

individual is classified as belonging to a specified group. 
classif program is written for use in an educational setting to 
‘sily potential passes or failures in final examinations on the 
of a predictor test given at the commencement of a course. 
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The value of Z is caleulated from the pass rate in the final examina- 
tion. 

This program can be used for classification of individuals into 
any dichotomy by equating one category to pass and the other to 
fail. 

The program is written in FORTRAN IV for an IBM 1401 
computer with 16 K storage. 


INPUT 


First and Second Data Cards—Heading. (May be blank) 

Third Data Card—Columns 1-3, number of class intervals which 

must be 10 or fewer. 

Fourth Data Card—Lower limits of class intervals beginning 

with the frequency count in the class interval with the highest 

predictor test score. Format 10 A 4. 

Fifth Data Card—Upper limits of class intervals beginning with 

the frequency count in the class interval with the highest predictor 

test score. Format 10 A 4. 

Sizth Data Card—Frequency count of scores on the predictor 

test, in each class interval for those who passed in the criterion 

examination, beginning with the class interval with the highest 

predictor test scores. Format 10 F 4.0. 

Seventh Data Card—Frequency count of scores on the predictor 

test for those who failed in the criterion examination beginning 

with the class interval with the highest predictor test scores. 

Format 10 F 4.0. 

Eighth Data Card—Column 3 and 4— Percentage passing rounded 

to whole numbers. 

A listing of this program will be supplied on request to Miss E. 
Walker, Sydney Technical College, Sydney, New South Wales, 
Australia. 
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Joel Allison, Sidney J. Blatt, and Carl N. Zimet. The Interpreta- 
tion of Psychological Tests. New York: Harper & Row, 1968. 
Pp. x + 342. $8.75. 


This volume represents a restatement of faith in the approach 
to psychological test interpretation expounded a generation ago by 
Rapaport, Gill, and Schafer (1945). As an introduction to clinical 
testing from the viewpoint of psychoanalytic ego-psychology, the 
holistic, impressionistic genre of the Rapaport et al. volume is ex- 
tended in an intensive analysis of a single case—a bright, young, 
hospitalized woman labeled Mrs. T. The analysis of Mrs. T's re- 
sponses to the WAIS, TAT, and Rorschach considers test scores, 
response content, and behavior during testing, relying heavily on 
the tester’s impressions. 

The book is aimed primarily at graduates and advanced under- 
graduate psychology majors and psychiatric residents, and it be- 
gins with an orientation to the holistic, psychoanalytic method of 
test interpretation. The guiding principle is that evaluation should 
specify the processes and organizing principles in personality, and 
tests should answer questions concerned with the examinee’s reality 
testing, impulse control, defenses, areas of conflict, and adaptability. 
The holistic orientation is witnessed by the statement that test 
Scores, content, attitudes, and transactions during testing should be 
Combined in evaluating test results. After a few critical remarks in 
Tegard to the current tendency to emphasize the scientific aspect of 
testing, the introduction closes with a brief description of the history 
of testing from Cattell to Rapaport. In contrast to certain contem- 
Porary psychologists, these authors are optimistic about the use- 
fulness of tests in clinical settings, but they maintain that it is only 
în such settings that one can learn to use tests effectively. 

Chapters 2, 3, and 4 describe the test content, scoring, and the 
Process of interpreting scores on the WAIS, TAT, and Rorschach, 
Tespectively, Mrs. "T's test results are presented and analyzed at 
cegth, in fascinating if questionable depth. The authors’ differen- 

ial reliance on the results of the three tests is made apparent by 
comparing chapter lengths: 69 pages for the WAIS, 46 pages for 

ce and 126 pages for the Rorschach. é 
m apter 5, a nine-page summary of Mrs. T's test results, is re- 
Qu^ With psychoanalytic-type conclusions, which are interesting if 

cult to verify. This is the well-known problem of validatng 
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psychoanalytie interpretations of test results not couched in terms 
of behavioral descriptions and concrete predictions. To the be- 
liever in psychoanalytic theory, the conclusions which he draws are 
valid if they are consistent with the theory. Such logic might be 
somewhat more palatable if psychoanalytic theory were substan- 
tiated in some way. The analyst may counter, with Freud, that the 
proof of the theory lies in clinical observation. But different ob- 
servers observe different things, and of course different psychoana- 
lysts often interpret the same test results differently. Only if the 
theory were accepted as a veridical description of personality dy- 
namics and test results were found to be consistent with theory 
would one be justified in entertaining, if not embracing, the ap- 
proach enunciated in this book. However, without some anchor— 
observational or psychometric—of acceptable validity to which 
the test results can be referred and related, it is difficult to be 
enthusiastic, 

To continue, Chapter 6 of the volume gives the results of re- 
testing Mrs. T two-and-one-half years after the first testing. There 
are certain differences in the results of the two testings, which the 
authors interpret as real personality changes, although some of 
the differences may be due to test unreliability. The final chapter 
presents a diary which Mrs, T kept for several months after being 
admitted to the hospital. Also discussed are the results of a post 
testing interview. 

It may be good that all of the psychoanalysts, humanists, and 
other impressionists have not been completely subdued by the pres- 
sures of objectivism and behaviorism, but without some plan for 
checking the validity of the methods elaborated in this book clinical 
psychologists who follow the procedures described by Messrs. Alli- 
son, Blatt, and Zimet will need strong defenses to cope with 4 
fact-fiction conflict of high valence, 
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Andrew R. Baggaley. Mathematics for Introductory Statistics, 4 
Programmed Review. New York: Wiley, 1969. Pp. xiv + 171. 
$6.95 and $2.95, (Paperback). 


Most introductory statistics courses seem to have more a 
their share of students who struggle with algebra. Many, if no 
most of these students, have taken course work in algebra. For 
such students, Baggaley attempts “to construct a condensed 16- 
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view of the bare minimum of (algebraic) processes." (p. vii) The 
brevity and programmed format of the text should encourage a 
structured review by students who would not normally attempt 
any review of mathematics. On the other hand, some valuable 
topics in need of review were omitted. 

The first chapter contains selected topics from a beginning course 
in algebra and would normally be assigned before or during the 
first few weeks of the elementary statistics course. Coordinate 
systems, plotting points, and graphs of linear equations are covered 
in Chapter 2. It would probably be assigned before the statistical 
topics of prediction and correlations. The complete theory for the 
computational method for calculating square roots by hand con- 
stitutes Chapter 3. The availability of tables and mechanical cal- 
culators would allow students to omit this chapter if the instructor 
80 desired. The use of summation notation (Chapter 4) is the only 
topic that is covered without assuming previous study by the stu- 
dent. Since it would be useful with the topics of the mean and 
standard deviation, the instructor may need to assign Chapter 4 
between Chapters 1 and 2. It seems unfortunate that the author in- 
troduces summation notation without covering the use of the index 
of summation. Such a procedure seems destined to cause problems 
when summation over more than sone subscript is eventually 
Teached. The author does not recommend that the text be com- 
pleted in one assignment. However, if necessary, complete reading 
Over a short period of time would be possible since the total 
Working time is estimated at just over three hours. 
dif comparison with competing texts indicates this text has a 

ifferent approach, which, depending on ones needs, may be either 
den or bad. The coverage is not nearly as comprehensive as 
Statin M. Walker’s classic, Mathematics Essential for Elementary 
Bind Md which contains nearly all requisite topics from algebra 
imm related areas. Some of the topics covered in Walker's text but 
as EM in this text are as follows: per cents, factorials, in- 
x ities, ratios and trigonometrie functions, linear interpolation, 
Sn logarithms, permutation and combinations, binomial 
its "oe general second degree equations in two variables and 
wit aph, short-cuts in computation, absolute value, and equations 
Hed SEAT variables. Many of the topies common to both texts 

i n in greater detail by Walker. Baggaley makes no at- 
rand introduce and integrate statistical concepts or formulas, 

i as examples. In Preparation for Basic Statistics, Clark and 

ced Present a mathematics review and introduce such topics as 
Sentier a variance, frequency, frequency table, z-score, histograms, 
varia] diagrams. In Walker's chapter on equations with several 
0 es she presents a particularly lucid discussion of the concept 

*&rees of freedom. 
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A serious omission in Baggaley’s algebra review is the topic of 
laws of exponents. These laws are unavoidably used in the text 
without any review and are certainly a topic most students fail to 
remember, Similarly, Baggaley does not explain “type products,” 
e.g, (z + y)?, but in the text he refers to "the cross-product term" 
when expanding a binomial. At least the square of a binomial 
should be developed. Without some additional explanation for the 
weaker student, the use of powers and type products may tend to 
be confusing, thus destroying an otherwise straight-forward de- 
velopment. At the risk of increasing the length of the text, addition 
of such topics as inequalities, absolute values, per cents, factorials, 
and the use of tables Id increase the amount of useable informa- 
tion for beginning st: ies students. To obtain space for the in- 
clusion of these topics, the author could omit the theory on cal- 
culating square roots, thus shortening Chapter 3 to coverage of 
some combination of (a) use of square root tables, (b) approximat- 
ing square roots and/or (c) the computational algorithm for cal- 
culating square roots (without the underlying theory). Omitting 
the sixteen pages of graph paper appended to the text would help 
conserve space, if not working time. 

The programmed format of the text should facilitate individual 
use by the students and save instructor’s time. The author includes 
adequate instruction for the reader who is not acquainted with 
programmed instruction. A masking shield is provided, and hope- 
fully this will encourage students to use the text correctly instead 
of just reading it. Objectives are stated before each chapter and a 
Teview test is included in the middle of Chapter 1 and at the end 
of each chapter. Answers to these tests are given in the back of the 
book. If individual frames had been numbered, the answers could 
have referred to appropriate frames for review of topics students 
failed to learn. 

The verbal summaries at the end of Chapters 1 and 3 would be 
more effective if they were given at the end of cach chapter in 
outline form, including the numbers of the individual frames cover 
ing the appropriate concepts. Enlarging the indéx would also allow 
a student to reference particular ideas he has forgotten. A student 
would probably make little subsequent reference use of the book 
in its present form after he has once worked through it. ) 

, The shortcomings of Mathematics for Introductory Statistics are 
incomplete coverage and the shortcomings of programmed 
books generally. The programmed format for a textbook may be 
good pedagogy, but it yields a notoriously inadequate reference 
work. We believe that students in an elementary statistics course 
will need to refer frequently to a mathematics review. Baggaley’s 
programmed format and inadequate index would significantly ham- 
per such referencing. Measured against Walker’s 35 year-old Mat 


| 


i BOOK REVIEWS 725 


matics Essential for Elementary Statistics, Baggaley's text. comes 
out second best. 
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Oscar K. Buros (Ed.). Reading Tests and Reviews. Highland Park, 
__ New Jersey: Gryphon Press, 1968. Pp. xiv + 520. $15.00. 


Reading Tests and Reviews is the first of a new series of mono- 
graphs (to use Professor Buros’ modest description of this 520- 
Page, hard-cover publication). First projected in 1940, the series 
will comprise four volumes, each listing tests and presenting test 
Teviews in a specialized area. The second volume in the series, 
Promised for 1969, will cover personality tests. 
pos volume includes: (1) a bibliography of reading tests known 

be in print in English-speaking countries as of May 1, 1968 or 
3 own now to be out of print, though once listed and perhaps re- 
ered in one of the six Mental Measurements Yearbooks; (2) 
Ranting of the reading sections of the six Yearbooks; (3) a 
( po and annotated listing of the 2802 tests listed in them; 
P. 4 directory of publishers of reading tests; (5) an index of titles 

i all tests listed in the six Yearbooks; and (6) an index of 

mes mentioned in them. 
The outstanding contribution of this monograph is that it brings 
eed in convenient form the enormous amount of bibliographi- 
material about, reading tests of the past and present that has 

M n available heretofore only by consulting separately the six 

„ental Measurements Yearbooks. As has been pointed out many 
tede, the quality of the reviews themselves varies greatly from 
Eme to reviewer and slightly from test to test by the same 
. ro Professor Buros has done remarkably well in choosing 
| RUE and in refereeing differences among publishers, authors, 
high Teviewers. In doing this, he has always maintained the 
est standards of integrity and editorial courage. 
| Rein comparison and evaluation of several reviews of the 
| test by different reviewers points up the uneveness in the 
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quality of the reviews that one merely senses as he has occasion to 
use the monograph in the normal course of academic duties, Con- 
sider, for example, entry 1099, which covers the Durrell-Sullivan 
Reading Capacity and Reading Achievement Tests, first published 
in 1937, and reviewed by four different individuals. These review- 
ers agreed that research beyond that underlying the development 
of the tests is needed to determine the validity for predicting 
reading disability of the difference scores obtained by giving both 
tests. One reviewer specifically questioned use of a measure of 
understanding of verbal concepts as a measure of capacity to read 
with comprehension, She would prefer to use a measure of under- 
standing nonverbal concepts for this purpose. She wrote (p. 27), 
“Many poor readers do well on performance-type intelligence 
tests, but less well on verbal intelligence tests such as the Binet. 
:. . Such children might fail to do well on both the Reading 
Capacity Test and the Reading Achievement Test because of 8 
disability in a verbal function common to both tests." 

This material helps to perpetuate the fallacy that performance 
in forming nonverbal concepts indicates the level of performance 
to be expected in forming verbal concepts if the mechanics of read- 
ing can be mastered. Research on mental abilities indicated by the 
middle 1930's that some individuals have high capacity for dealing 
with nonverbal concepts and low capacity for dealing with verbal 
concepts, and vice versa. Consequently, in diagnosing reading dis- 
ability it is advisable to use the procedure adopted by Durrell and 
Sullivan of comparing performance on verbal materials approached 
through reading and independently through hearing stories and 
looking at pictures. However, recent studies (Singer, 1965) indicate 
that further research is needed on this point. 

_A basic weakness in the Durrell-Sullivan materials is not men- 
tioned in any of the four reviews. It lies in the fact that differences 
between scores on their “capacity” and “achievement” tests are 
not properly interpreted if the directions in the test manual arè 
followed. These directions apparently use the standard error of a 
difference, instead of the standard error of measurement of à 
difference, for judging the significance of a difference between 8 
pupil’s obtained scores on the “capacity” and “achievement” tests. 
If the directions in the manual are followed, too few differences are 
identified as significant at any specified level. Unfortunately, the 
diagnostic usefulness of the tests has been understated for thirty 
years because the publisher failed to have the procedures evaluate 
by a psychometrieian. The point at issue was explained by T. L 
Kelley in 1927 in Interpretation of Educational Measurements, 
published by the same company as the Durrell-Sullivan tests. 

The set of four reviews of these tests illustrates the strengths 
and weaknesses of Reading Tests and Reviews. Along with valu- 
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aphical data and helpful descriptive information, 
ubstantial amount of misinformation and no mention 
tant weakness in the test materials. 

n, it should be said that the design, typography, and 
he book are first rate; for this, we must salute Oscar 
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edwards, Statistical Analysis. (8rd ed.) New York: Holt, 
à rt and Winston, New York: 1969. Pp. xi + 244. $7.95. 


d edition of Edward's Statistical Analysis—a book which 
nearly a quarter of a century—repeats both the appeal 
2 thereof of the earlier editions. Since comparisons are 
lly considered odious, an evaluation of the volume in its own 
ill be attempted, and lists of differences between this and 
8 avoided. 
"positive side, the book is eminently well-written and the 
presented in such fashion as should appeal to the most 
he familiar lucidity of Edwards’ prose has never been 
here is a decided Linus’-blanket effect that would 
most panicky of students approach the subject with 
f finding some kind of security among the explanations 
A serious and successful attempt has been made by 
ender the fundamentals of statistics palatable. 
ays, Edwards’ approach has become highly appropri- 
.There has been an interesting evolution in the 
Statistics. Perhaps only ten years ago, topics beyond 
mentary analysis of variance models were reserved for 
Ourses in statistics, Ten years ago, Statistical Analysis 
llent introductory text. Then as the content of statistics 
‘an elementary and intermediate level became more 
d, the volume must have lost some of its appropriaté- 
e universal accessibility to electronic computers has 
ever, the style and content of Statistical Analysis, 
in its most recent edition, is worthy of considera- 
ctors. $ 
ber of students in psychology, and perhaps more so 
n, the use to which inferential statistical methods are 
d. Without making a value statement of this latter 
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observation, it is the ease that a not insignificant proportion of 
students utilize little more than the most elementary descriptive 
statistical techniques. These students rely on specialists to advise 
them in carrying out a statistical analysis and will perhaps venture 
to the computer centre for design and computational aid. What 
they believe they need from a statistics course is a moderate 
grasp of a few concepts to help them interpret whatever data fall 
their way, and a suitable vocabulary to enable them to communi- 
cate with their consultants. They care little or nothing about de- 
rivations or innovations, 

For the category of student so characterized, Edwards’ text is 
likely to be most useful. It provides an essentially verbal intro- 
duction to the bread-and-butter notions of statistics, requires al- 
most no quantitative manipulation of any consequence, spans à 
range of topies that is just adequate, and should leave the student 
well enough aware of the strengths and shortcomings of common 
statistical techniques to sharpen his interpretations. It is not a book 
for the student who has to do his own design-work or computation, 
nor is it a book for the student who intends to become a measure- 
ment major—too much would have to be undone or redone. But 
for those students for whom a one semester statistics course is 
their sole contact with the subject, the book is a more than rea- 
sonable choice. 

Even so, there are still a few areas in the book that one would 
have preferred to see presented with different emphasis. To allocate 
20 pages to “frequency distributions” as against 12 pages to analy- 
sis of variance seems a little strange. There is a stage of quantitative 
development that one has to assume and with it a valuing of 
what is currently relevant to research, Surely, if students are to be 
required to put effort of their own into a course, it is more reasonable 
to expect them to handle such rudimentary topics as graphs by 
themselves than it is to contend with t and F tests. It is in its 
adherence to a somewhat outdated value-system among statistical 
topics that the detractions in the book lie. It may also have been 
useful for the assumptions underlying the various tests to have 
been spelled out more explicitly. In short, the kind of student this 
book is Seen as most useful for is going to be disserved in some 
senses—he is going to obtain a rather conservative and dated view 
of what is important in statistics, and he is going to be without 
guidelines as to what assumptions he has to meet in applying these 
statistical models. 

There are sixteen chapters spread over the 200 pages, & useful 
appendix of the common statistical tables and a full index. On the 
whole, the choice of content is one of the strengths—all of the 
useful elementary indices and distributions are included, plus i"? 
fine, if sketchy, chapters on test theory and sampling. 


| 
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There is little to be said about much of the content, for it is 
just what one would expect and is quite conventional in approach. 
The first chapter provides an overview and some basic vocabulary; 
chapter 2 rather belabours the notion of scales of measurement 
(the table of contents contains a locational error for this chapter, 
incidentally). As do all chapters, these end with a review of terms, 
symbols, questions, and problems. Answers to the problems are 
given in an appendix, though the problems are generally minimally 
difficult. 

, Chapter 3 presents a basic appreciation of frequency distribu- 
tions. This is a puzzling chapter, being at once both very simple 
and rather confusing. There is no help to the student in locating 
the bars of his histograms, for example (where do the midpoints 
of the “bars” go?); yet some of the points which are made so 
elegantly by means of the excellent diagrams are belaboured in 
the text. This kind of chapter is always bothersome. Granted, the 
author wishes to appeal to an essentially nonmathematical audi- 
ence, But surely most students who arrive at a university will be 
familiar enough with the essence of graph construction to warrant 
less emphasis on this topic? Specially since the frequency distribu- 
tions of greatest interest in statistics are not even alluded to? 

Chapters 4 and 5 are fairly standard explications of the common 
Measures of central tendency and variability. The chapter on 
Central tendency is clear and to the point. The chapter on varia- 
bility tends to be a listing rather than an explanation. The variance 
Me values only) is blithely defined without any indication as 
he Why this partieular number reflects variation or why we should 

e dividing by (n-1). Now again, some students probably don't 
s Wonder why—but these kinds of skitterings are precisely why 
the book is less appealing for the more serious-minded. Some of 
ts Shortcomings are alleviated as the chapter proceeds, but it 
dus take an astute student and an alert instructor to piece the 
Due into anything like whole cloth. Quartile deviations are, 

erestingly enough, completely ignored. 

t apter 6, on product-moment correlation, is rather well done. 
ae all the basics and the understanding necessary for 
and ms Correlation and regression at an elementary level. This, 
and P following four chapters on elementary probability theory 
poi, P'obability-distributions are well done in selecting the critical 

ah S from the theory without raising misconceptions. The fact 
gent is simple is not of itself a detraction when it is 
ama in this fashion. One might have a few questions—for 
Tum ni e, why raise the issue of conditional probability at all if 
plain afford to spend only a page on it? But, these chapters are 

Th Straightforward, good sense. 

€ next group of chapters deal with £, Analysis of Variance and 
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Chi Square. Chapter 11, on £, is satisfying enough, but chapters 
12 and 13 are barely adequate. Both these latter chapters resort 
to a “this-is-it, gang" style, and leave more unanswered, than 
answered-at least if anyone is interested enough to read other 
than superficially. Again, we see the enigma of the book: it could 
be frustrating to the better student while being eminently suited 
to the less interested. 

The final three chapters again return to elegance and simplicity. 
The content covers indices of association other than r, the theory 
of error of measurement, and some sampling theory. Edwards 
writing is at its best when he is concerned with correlational 
techniques and these chapters are no exception. They are all ade- 
quate, essential and clear. 

'The book, then, is on the whole very suitable for the below- 
average and terminal student who is at the same time aware of his 
own limitations. The notions and vocabulary to be gained from 
the book could be rarely faulted, but they really are no more than 
notions at a very elementary level. If this is an instructor's pur- 
pose he would find a highly attractive presentation in the new edi- 
tion of Edward's book. 


PETER A. TAYLOR à 
The University of Manitoba 


Richard I. Lanyon. A Handbook of MMPI Group Profiles. Min- 
quos University of Minnesota Press, 1968, Pp. vii + 79: 


The author, Richard I. Lanyon, is an Associate Professor of 
Psychology at the University of Pittsburgh who received the Ph.D. 
degree in clinical psychology from the University of Iowa. The 
book is a collection of group profiles on the MMPI. The subjects 
represent various groups classified according to diagnostic 0T 
behavioral traits. Group profiles are presented for subjects diag- 
nosed as having psychotic disorders, personality disorders, m 
cellaneous psychiatric disorders, psychophysiological and physi¢é 
disorders, and brain disorders. Other major categories include 
adolescents, parents of disturbed children, prisoners, student and 
occupational groups, and racial and cultural groups. Group P'Ó- 
files for subjects from a number of miscellaneous categories 21° 
also included. These groups are characterized by being aged, homo” 
NEC deaf, pregnant, cerebral palsy victims, or stutterers. Other 
Pt E Jie ME from groups representing varied conditions 5 
. This handbook reports 297 mean MMPI profiles with relevant 
information about the background of each group and the manner 
of selection and testing. It is designed to serve as a source book 0 
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basic data for both research and clinical purposes. According to 
the author, the group profiles are not particularly well suited to be 
of direct aid in the interpretation of individual profiles; they are 
better regarded as a general source of information about the re- 
lationship of the MMPI to the diagnostic and behavioral cate- 
gories represented. Thus, the most useful information for improy- 
ing diagnostic skill can probably be gained by noting differences 
among the mean profiles of different diagnostic groups rather than 
by attempting to find the group providing the best fit for an in- 
dividual profile. According to the author, the handbook can be used 
as a source of research data in two ways: as a source of test data 
about different groups of people or as a source of validity data 
about the MMPI. 

The author devotes most of the introduction to a description of 
the manner in which the MMPI scales were constructed and to 
standard interpretations of the scales. He discusses demographic 
and other variables as related to the inventory. Dr. Lanyon points 
out the importance of considering base rates in the interpretation 
of profiles. 

The profiles in the volume are separated into thirteen categories. 
The headings of the first categories follow the American Psychiat- 
ric Association's diagnostic and statistical manual while the re- 
mainder were chosen for convenience of reference. The following 
guidelines were employed in selecting the groups reported: (a) each 
group possessed some noteworthy behavioral or biographical char- 
acteristic defined independently of the MMPI itself; (b) at least a 
minimal amount of descriptive information was available about 
oe group; and (c) groups showing few or no distinctive MMPI 
ets were included in order to indicate behaviors that 
ae s reflected in the MMPI. There must be hundreds of groups 
m could have been presented in this handbook. One wonders 
What additional criteria were used in selecting the 297 groups for 
Inclusion in the book. 
den Erg presented give one a quick visual impression of 
high A aracteristics of the given group. It would have been 
S na oe to have means and standard deviations for the 
ánd th "i with these data one could statistically compound groups 
would ereby might learn additional information. Combining groups 
esM one a profile with greater stability because of the larger 
of the res piene included and probably increase the generality 
Dt teni A Such is of importance for the groups represented in 

The ha a are haphazard samples from ill-defined populations. 

Secor wl. ook will be a useful reference for students and pro- 

Fade are caught up in the mystique of the MMPI. It might 
personalit na of hypotheses for persons engaged in research on 
y. However, serious investigators will want to go to the 
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original research reports from which the handbook data were 
drawn. 

The reviewer found the profiles stimulated his thinking and rec- 
ommends the handbook as a source of hypotheses and a list of 


references to MMPI data on some relatively well identified subject ' 


groups. 
WILBUR L. LAYTON 
Iowa State University 


Paul McReynolds (Ed.). Advances in Psychological Assessment, 
Vol. 1. Palo Alto, Calif.: Science and Behavior Books, 1968. 
Pp. 336. $9.50. 


Advances in Psychological Assessment is the first volume of à 
series, the purposes of which are the description and evaluation of 
new developments in the technology of assessment, the presentation 
of innovations in assessment theory and methodology, and the 
summarizing of important topics in assessment. The 14 chapters 
of this volume, some of which deal with a single instrument and 
others with a multitude of psychometric devices, are: 

1. “An Introduction to Psychological Assessment” by Paul Mo- 
Reynolds—A brief overview of the current scene in psychological 
assessment and the organization of the book. 
| 2. “Current Conceptions of Intelligence and Their Implications 
for Assessment” by Thomas J. Bouchard—A survey of the major 
theoretical points of view on intelligence, including factor-analyti¢ 
(Guilford, Vernon, Cattell), Guttmann’s facet theory, Hayes’ ex- 
perience-producing drives (EPDs), computer simulation, and Pia 
getian theory. Bouchard points out that current efforts in this area 
are concerned with describing the structure of cognition and thought, 
and of specifying the biological and environmental conditions that 
control its growth and development. The assessment of intelli 
gence is viewed as a matter of sampling behavior to obtain in 
formation on schemata, operations, and concepts. 

3. ‘Assessment in the Study of Creativity” by H. Edward Tryk 
—Summarizes the literature on assessing creativity, including oT 
ativity as a product, 
the whole person. The writer gives extensive descriptions of Med- 
nick's Remote Associates Test, Guilford's tests of creativity, Tor- 
rance's Tests of Creative Thinking, and the Barron-Welsh 
Scale. He recognizes that instruments for measuring creativity are 
generally unsatisfactory, particularly because of the difficulty ° 
isolating and quantifying meaningful criteria and the failure 
pay more attention to motivational factors in validating tests © 
creativity. 


4. “An Interpreters Syllabus for the California Psycholosic#! 


a process, a capacity, and as encompassing 


——— 
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Inventory" by Harrison G. Gough—Describes the 18 scales of the 
CPI and, how they were constructed, and gives adjectival descrip- 
tions of high and low, male and female scores. Gough also includes 
a brief discussion of interactions among the scales and scale score 
profiles. 

5. “The TSC Scales: The Outcome of a Cluster Analysis of the 
550 MMPI Items” by Kenneth B. Stein—This study represents the 
first attempt to apply a factor-analytic approach to the entire 
group of MMPI items. Seven item clusters were found: social intro- 
version, body complaints, suspicion and mistrust, depression, Te- 
sentment, autism, and tension. 

6. “The Strong Vocational Interest Blank: 1927-1967” by David 
P. Campbell—A brief, but thoroughly interesting history of the 
p from the time of its introduction in the 1920’s to the present 

y. 

7. "Current Status of the Rorschach Test” by Walter G. Klopfer 
—Describes the current status of the Rorschach, not as a marginal 
instrument for probing the depths, but as a set of stimuli concerning 
which there is much research evidence. 

8. "Asessment of Psychodynamie Variables by the Blacky 
Pictures” by Gerald S. Blum—The construction and theory of 
the Blacky Pictures is discussed, together with a review of research 
and special-purpose uses of the instrument. 

9. Operant Conditioning Techniques in Psychological Assess- 
ment" by Robert L. Weiss—Deals with the applications of operant 
conditioning methods to the assessment of individual differences. 
RE Assessing Change in Hospitalized Psychiatric Patients" by 
ne James Klett—Scales for rating psychiatric observations and 
notes of ward behavior are described. Klett maintains that 
- Tecent years such devices have become less subjective and based 

d n objectively controlled experiments. 

Probl The Assessment of Counseling and Psychotherapy: Some 
ie ems and Trends” by George A. Muench—Details some prob- 
ips and trends in the assessment of psychotherapy, viz. the prob- 
re in defining outcome criteria, of determining adequate criterion 

E Ted of controls, of the therapist’s interview behavior, and of 
ah nis lent's unique individuality. Also dealt with are current trends 

S CRRA (1) the outcomes of counseling and psychotherapy, (2) 

7 Teatability of clients, (3) the personality of the therapist, (4) 

a relationship of client and therapist, and (5) the duration, of 

atment, ; "n 
ad “Conjoint Family Assessment: An Evolving Field” by Ar- 
among § Bodin—Deals with the assessment of the relationships 

13 op members and the results of conjoint family therapy. 
niques he Assessment of Anxiety: A Survey of Available Tech- 

by Paul McReynolds—The editor of the volume presents a 
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original research reports from which the handbook data were 


drawn. 

The reviewer found the profiles stimulated his thinking and rec- 
ommends the handbook as a source of hypotheses and a list of 
references to MMPI data on some relatively well identified subject 
groups. 


Wusvr L. Layton 


Iowa State University 2 


Paul McReynolds (Ed.). Advances in Psychological Assessment, 
Vol. 1. Palo Alto, Calif.: Science and Behavior Books, 1968. 
Pp. 336. $9.50. 


Advances in Psychological Assessment is the first volume of a 
series, the purposes of which are the description and evaluation of 
new developments in the technology of assessment, the presentation 
of innovations in assessment theory and methodology, and the 
summarizing of important topics in assessment. The 14 chapters 
of this volume, some of which deal with a single instrument and 
others with a multitude of psychometric devices, are: 

1. “An Introduction to Psychological Assessment" by Paul Me- 
Reynolds—A brief overview of the current scene in psychological 
assessment and the organization of the book. 

` 2. “Current Conceptions of Intelligence and Their Implications 
for Assessment” by Thomas J. Bouchard—A survey of the major 
theoretical points of view on intelligence, including factor-analytic 
(Guilford, Vernon, Cattell), Guttmann’s facet theory, Hayes’ ex- 
perience-producing drives (EPDs), computer simulation, and Pia- 
getian theory. Bouchard points out that current efforts in this area 
are concerned with describing the structure of cognition and thought, 
and of Specifying the biological and environmental conditions that 
control its growth and development. The assessment of intelli- 
gence is viewed as a matter of sampling behavior to obtain in- 
Eie) operations, and concepts. k 

. ent in the Study of ivity" rd Ti 
—Summarizes the literature n cosa 
ativity as a product, a process, 
the whole person. The writer gives 
nick’s Remote Associates Test, Guilford’s tests of creativity, Tor- 


isolating and quantifying meanin iteri i Y 
l gful criteria and the failure 
NE attention to motivational factors in validating tests of 


4. "An Interpreter’s Syllabus for the California Psychological 
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Inventory" by Harrison G. Gough—Describes the 18 scales of the 
CPI and, how they were constructed, and gives adjectival descrip- 
tions of high and low, male and female scores. Gough also includes 
a brief discussion of interactions among the scales and scale score 
profiles. 

5. “The TSC Scales: The Outcome of a Cluster Analysis of the 
550 MMPI Items" by Kenneth B. Stein—This study represents the 
first attempt to apply a factor-analytic approach to the entire 
group of MMPI items. Seven item clusters were found: social intro- 
version, body complaints, suspicion and mistrust, depression, re- 
sentment, autism, and tension. 

6. "The Strong Vocational Interest Blank: 1927-1967" by David 
P. Campbell—A brief, but thoroughly interesting history of the 
PIR from the time of its introduction in the 1920's to the present 

ay. 

7. “Current Status of the Rorschach Test” by Walter G. Klopfer 
—Describes the current status of the Rorschach, not as a marginal 
instrument for probing the depths, but as a set of stimuli concerning 
which there is much research evidence. 

8. *Asessment of Psychodynamic Variables by the Blacky 
Pictures” by Gerald S. Blum—The construction and theory of 
the Blacky Pictures is discussed, together with a review of research 
and special-purpose uses of the instrument. 

9. “Operant Conditioning Techniques in Psychological Assess- 
ment” by Robert L. Weiss—Deals with the applications of operant 
conditioning methods to the assessment of individual differences. 

10. “Assessing Change in Hospitalized Psychiatric Patients” by 
C. James Klett—Seales for rating psychiatric observations and 
inventories of ward behavior are described. Klett maintains that 
in recent years such devices have become less subjective and based 
more on objectively controlled experiments. 

11. “The Assessment of Counseling and Psychotherapy: Some 
Problems and Trends” by George A. Muench—Details some prob- 
lems and trends in the assessment of psychotherapy, Viz., the prob- 
lems in defining outcome criteria, of determining adequate criterion 
measures, of controls, of the therapist’s interview behavior, and of 
the client's unique individuality. Also dealt with are current trends 
in assessing (1) the outcomes of counseling and psychotherapy, (2) 
the treatability of clients, (3) the personality of the therapist, (4) 
the relationship of client and therapist, and (5) the duration of 
treatment, 2 É 

12. “Conjoint Family Assessment: An Evolving Field” by Ar- 
thur M. Bodin—Deals with the assessment of the relationships 
among family members and the results of conjoint family therapy. 

13. “The Assessment of Anxiety: A Survey of Available Tech- 


niques” by Paul McReynolds—The editor of the volume presents a EC 
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detailed, enlightening review of various psychometric me 
anxiety. He also makes a number of suggestions for future 
on the topic; for example, the development of stimulus-n 
inventories of anxiousness. 

14. “Psychophysiological Assessment: Rationale and Probl 
By James R. Averill and Edward M. Opton, Jr.—The aui 
the concluding chapter present a rare, stimulating dise 
episodie and dispositional variables, emotional reactions, 
sonality traits in psychophysiological assessment. They po 
that psychophysiological measures are not necessarily more “ 
mental" than behavioral measures, but neither are they as 
to employ as some imagine. t E 

From this potpourri of papers, anyone concerned with] 
logical measurement should be able to find at least a few o: 
to him. Especially worthwhile reading, from the reviewer's yi 
point, are the presentations in Chapters 2, 3, 13, and 14. But: 
the papers are of substance, and the editor has done a comm 
able job in this first volume. Let us hope that all of the light 
not been spent and that future volumes in the series follow suit. — 
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TRAIT AND EVALUATIVE CONSISTENCY 
IN SELF-DESCRIPTION: 


ALLEN L. EDWARDS 
University of Washington 


ments can be estimated by having a group of judges rate the state- 
"ments on a 9-point scale in accordance with standard instructions 
described by Edwards (1957). On the 9-point rating scale, 1 repre- 
sents extremely undesirable statements, 5 represents neutral state- 
ments, and 9 represents extremely desirable statements. The average 
Tating assigned to a statement is the SDSV of the statement. 

If subjects are asked to describe themselves in terms of a set of 
Personality statements, the percentage answering True, P(T), to 
each statement can be obtained. For random or representative sets 
of statements, P(T) is a linear increasing function of the SDSVs of 
the statements. The evidence regarding the linear relationship be- 
tween P(T) and SDSV has been summarized by Edwards (1967), 
‘Who reports that the typical product-moment correlation between 
these two variables is about .87. 

When the SDSV of a personality statement is known, it is possible 
to define a socially desirable (SD) response to the statement. An 
SD response is defined as a True response to statements with 
SDSVs > 5.0 or as a False response to statements with SDSVs 
< 5.0. A socially undesirable (SUD) response is, of course, just 

_ the opposite of an SD response. Individual differences in rates of 
] SD responding can be measured by counting the number of SD 


7 
Tum social desirability scale values (SDSVs) of personality state- 
| 


Be A 

This research was supported in part by Research Grant 2 ROL MHO4075-08 

fom the National Tostluto of Mental Health, United States Public Health 
Service, 
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responses that subjects make in answering a given set of personality — 
statements. 

In scales designed to measure specific personality traits, the trait 
response to each item in the scale is, by definition, either an SD or 
an SUD response. If all or most of the items in a trait scale are 
keyed for SD or SUD responses, then scores on the scale may be 
highly correlated with other independent measures of rates of SD 
responding. When this is the case, the trait scores may reflect in- 
dividual differences in the trait, individual differences in rates of 
SD responding, or both. In other words, trait scores and rates of 
SD responding are, in this instance, completely confounded. 

In a study by Edwards (1961) scores on 43 MMPI scales, de- 
signed to measure different personality traits, were correlated with 
Scores on an SD scale designed to measure rate of SD responding, 
For each MMPI scale the percentage of trait responses which were 
also SD responses was obtained. The signed correlations of the 
MMPI scales with the SD scale were found to be linearly related to 
the percentage of SD responses in the MMPI scales, The correlation 
between these two variables was .92. In another study by Edwards, 
Diers, and Walker (1962) 58 MMPI scales and 2 other scales were 
intercorrelated and factor analyzed. The signed loadings of the 60 
scales on the first Principal component were found to correlate .90 
with the percentage of items in the scales keyed for SD responses. 

In still another study, Edwards and Walsh (1963) subdivided 
MMPI scales into two parts: those items keyed for SD responses 
and those keyed for SUD responses. To determine the intensity of 
the SD and SUD keying of each part of these scales, the SDSVs of 
items keyed False were reflected. The sum of the reflected and un- 
reflected SDSVs was then obtained for each scale and this sum was 
divided by the number of items in the scale to obtain an average 
index of intensity for each Seale. The scales were then intercorre- 
lated and factor analyzed. The correlation between the signed 
loadings of the scales on the first principal component and the 
Average index of intensity of the scales was .88. 

When a scale consists of a single item, the intensity of the SD or 
SUD keying of the item is simply the SDSV of the item. For exam- 
ple an item with an SDSV of 7.0 keyed True is assumed to represent 
a greater intensity of social desirability than an item keyed True for — 
which the SDSV is 6.0. Similarly, if an item has an SDSY of 2.0 and - 
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is keyed True, it is assumed that this item has a greater intensity of 
social undesirability than an item with an SDSV of 3.0 which is also 
keyed True. 

Let the score on a single item scale be 1 if answered True and 0 if 
answered False. For any given set of items it is possible to obtain 
the intercorrelation matrix of the item responses and to factor 
analyze this matrix. If the results of the Edwards and Walsh study 
are applicable to single item scales, then the loadings of the items on 
the first principal component should be linearly related to the 
SDSVs of the item. 

The 90 trait terms listed in Table 1 may be considered a set of 
90 items to be answered True if a subject believes the term is des- 
criptive of him and False if he does not. Note, however, that the 90 
terms have been arranged in 15 sets of four terms and 10 sets of 
three terms, The terms within each set were intended by Peabody 
(1967) to represent a common bipolar trait. In the first set, for 
example, “cautious” and “timid” were assumed to represent one pole 
of a common trait and “bold” and “rash” the opposite pole. Further- 
more, one of the two terms at each pole was intended to be positive 
or to have a socially desirable scale value and the other to be nega- 
tive or to have a socially undesirable scale value. The deviation 
SDSVs listed in Table 1 are those reported by Peabody, who had 
the traits rated on a scale ranging from —3 to +8, with 0 defined as 
the neutral point. 

Within each set of four terms it is possible to caleulate two cor- 
relation coefficients between two terms that are opposite in descrip- 
tive similarity and also opposite in evaluative sign. For example, for 
the first set, these correlation coefficients would be those between 
“cautious-rash” and “bold-timid.” Similarly, each set of three terms 
results in one correlation coefficient of the kind described. We shall 
refer to these correlation coefficients as Type I coefficients. 

For Type I coefficients, the trait consistent responses to the two 
terms are completely confounded with SD responses to the terms 
as shown in Table 2. For example, if a subject answers True to “cau- 
tious,” then he should answer False to "rash" and both of these 
Tesponses are SD responses. Similarly, if a subject answers True to 
"rash," then he should answer False to “cautious” and both of ‘these 
Tesponses are SUD responses. Both trait and SD consistencies in 
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responding should, in this instance, result in negative values for 
Type I coefficients. 

Within each set of four terms it is also possible to caleulate two 
correlation coefficients between two terms that are opposite in de- 
scriptive similarity but such that the evaluative signs of the two 
terms are either both plus or both minus. For example, in the first 
set, these two correlation coefficients would be those between 
“cautious-bold” and between “timid-rash.” Another 10 such cor- 
relation coefficients can be calculated for each set of three terms. 


TABLE 1 
Sets of Trait-Terms and Their Evaluative Ratings 


—— Áo 
Set 
no. Sets of four 
EE cmo 5 


Temperament 
1 +.9 Cautious T1.1 Bold 
-1.1 Timid —1.2 Rash 
2 +1.7  Self-Controlled +1.1  Uninhibited 
—1.4 Inhibited —.38 Impulsive 
3 +1.3 Serious T1.5 | Gay 
—1.6 Grim —1.2 Frivolous 
4 42.0 Alert +1.8 Relaxed 
—1.1 Tense -1.7 Lethargic 
5 +.8 Committed +2.5 Open-Minded 
—2.4 Fanatical —.8 Noncommital 
6 +1.3 — Steady +1.6 Flexible 
—2.1 Inflexible —1.5 Vacillating 
7 42.0 Modest +1.3 Confident 
=1.1  Self-Disparaging —2.0 Conceited 
Social 
S t.9 Thrifty +1.8 Generous 
9 —2.0 Stingy —.8 Extravagant 
4-5 Skeptical +1.1 Trusting 
10 -1.4 Distrustful —1.4  Qullible 
+1.3 Selective +2.5 Tolerant s 
d 5: Choosy —1.4  Undiscriminating 
TL3 Fimm +.9 Lenient 
12 ql Severe —.9 Lax 
+1.3 Discreet +1.8 Frank 
1 T12 Secretive —1.4 Indiscreet 
+2.0 . Individualistic +1.6 Cooperative 
72.0 . Uncooperative —1.6 Conforming 
m Ideas and ability 
+.9 Pragmatic +1.5 Idealistic 
15 a rtunistic —1.2  Unrealistic 
= Cultivated +2.1 Natural 
E —.7 Naive 
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Sets of three 
Temperament 

16 +1.8 Thorough 

—1.5 Fussy —1.4 Careless 
17 +1.3 Moral 

—1.6 Self-Righteous —1.6 Immoral 
18 +1.8 Curious 

—1.7 Nosy —1.5  Uninquisitive 

Social 

19 +1.0 Forceful 

—2.0 Domineering —1.4 Submissive 
20 +1.7 Peaceful : 

—1.2 Passive —2.0 Belligerent 
21 +1.6 Polite 

—1.5 Ingratiating —2.2 Rude 

Ideas and ability 

22 +2.6 Intelligent : 
a —.7 rafty —1.7 Stupid 

+1.9 Foresighted 1 

—1.8  Scheming —1.5  ShortSighted. 
2u 41.6 Meditative n 

-.9 i —1.6 Unmeditative 
25 +1.8 Witt 
- Sareastic —2.1 Humorless 


. Twenty judges rated the 
et a a Aai ERE 
are the means for the 40 


Note.—The evaluative ratings are those given by Peabody (19 

on a "favorable-unfavorable" scale and 20 rated the terms on 

pr In both cases the rating scale was from +3 to —3. The ratings 
judges, 


We shall refer to correlation coefficients of the kind described as 
Type II coefficients. 

For Type II coefficients, if the primary determiner of responses 
to the paired terms is trait consistency, then the dominant response 
patterns should be TF and FT, as shown in Table 2, and the Type 
II coefficients should be negative. On the other hand, if subjects 
tend to respond to these pairs of traits in terms of either SD or 
SUD consistencies, then the dominant response patterns should be 
TT and FF, as shown in Table 2, and the Type II correlations 
should be positive. 

Within each set of four terms two correlation coefficients can be 
calculated between traits that are descriptively similar but which 
are opposite in evaluative sign. For example, in the first set these 
two correlation coefficients would be those between “cautious-timid” 
and between “bold-rash.” Similarly, each set of three terms results 
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TABLE 2 E 
Trait Consistent and SD Consistent Patterns of Response to Pairs of Trait Terms 
in the First Set and Predicted Signs of the Correlation Coefficient 1 
between the Paired Terms 


Consistent Signof Consistent Sign of Type of 


Trait pairs trait patterns r SD patterns T 
Cautious-Rash TF and FT - TF and FT - I 
Bold-Timid TF and FT - TFandFT  — I 
Cautious-Bold TF and FT - TTandFF 4 I 
Timid-Rash TF and FT - FFand TT + II 
Cautious-Timid TT and FF + TFandFT — III 
Bold-Rash TT and FF t TFandFT  — m 


in one such correlation coefficient. We shall refer to correlation 
coefficients of the kind described as Type III coefficients. L 

lf subjects give trait consistent responses, then the dominant 
Tesponse patterns for Type III coefficients should be TT and FF, as 
shown in Table 2, and the Type III coefficients should be positive. ^ 
But if subjects tend to respond to these pairs of traits in terms of — 
either SD or SUD consistencies, then the dominant patterns of 
response should be TF and FT and, as indicated in Table 2, the 
Type III coefficients should be negative. 

The 90 trait terms thus provide an interesting set for investigating 
consistencies in “evaluative” or social desirability responding in 
self-description and also consistencies in trait responding. 


Method 


Social desirability ratings of the 90 traits were made by a group 
of 88 male and 126 female students on a 9-point rating scale. The 
traits were not presented to the judges in the same order in which 
they are listed in Tables 1 and 4. Instead each of the 90 traits was 
typed on a 3 x 5 card and the cards were thoroughly and independ- 
ently shuffled by two individuals to arrange them in random order. 
bue traits were then printed in this random order in a booklet con- 
sisting of four pages. The first three pages of the booklet each 
contained 25 traits and the last page 15 traits. The judges were told 
that if they did not know the meaning of a given trait they should 
mark it with an X and not to assign a rating to it. Of the 90 traits, 
five Were marked with an X by 16 or more of the combined group 
of 214 judges. These traits are listed in Table 3, which also shows — 
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the percentage of the judging group that did not assign a rating to 
them. 

SDSVs of the traits were determined separately for male and 
female judges, using only those judges who assigned a rating to the 
trait. The correlation between the male and female SDSVs was 
986. For the male judges, the mean SDSV was 4.87 and the standard 
deviation of the SDSVs was 1.48. The corresponding values for the 
female judges were 4.79 and 1.81. 

A simple unweighted mean of the male and female SDSVs was 
obtained for each trait and these are the SDSVs used in the present 
study. The SDSVs for each trait are given in Table 4. The correla- 
tion between these SDSVs and those obtained by Peabody and listed 
in Table 1 is .96. 

The 90 traits were given to another group of 127 male and 180 
female students. These students were asked to describe themselves 
in terms of the traits by marking the trait True if they believed it 
accurately described them and False if they did not, The same ran- 
dom order of presentation of the traits used in obtaining social de- 
sirability ratings was used in obtaining self-descriptions. The sub- 
jects were also told that if they did not know the meaning of a given 
trait they were to mark it with an X and not to answer it True or 
False. The five traits with the largest number of X responses were 
the same as those which were most often not rated by judges in 
obtaining the SDSVs of the traits. Table 3 shows the percentage of 
the combined group of 307 subjects who failed to answer each of 
the five traits, : 

There is no reason to believe that the students who rated the traits 
for social desirability were any more knowledgeable about the mean- 
ing of the traits than those who described themselves. Yet, in every 
case, the percentage of students not answering & trait in self-de- 


TABLE 3 
T 7 i Proportion 
Proportion of the N = 214 Judges Not Rating Five Terms and 
i aN 307 ‘Subjects Not Answering the Same Five Terms 


x u ie A —— — — m ym 


Ratings Self-Description 
Pragmatic .150 .362 
S i i 145 .371 
elf-Disparaging b AG 
Lethargic p pu 
Ingratiating 4 A 
Vacillating .136 .401 
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TABLE 4 
SDSVs of 90 Trait Terms and Loadings of the Terms on the 
First Principal Component 
—— U 
Evaluative Factor 
Trait no Trait SDSV sign loading 
1 Cautious 5.58 + 15 
2 Bold 6.42 a. 02 
3 Timid 3.45 = —929 
4 Rash 3.60 = -33 
5 Self-Controlled 6.98 + 41 
6 Uninhibited 5.88 + [i 
7 Inhibited 3.67 = -95 
8 Impulsive 5.30 + -21 
9 Serious 5.92 + 01 
10 Gay 6.57 t 23 
nu Grim 3.01 e —45 
12 Frivolous 4.10 - -21 
13 Alert 7.14 + 34 
14 Relaxed 6.60 + 44 
15 Tense 3.44 - —41 
16 Lethargic 3.04 - —37 
17 Committed 5.98 + 19 | 
18 Open-Minded 8.12 t 22 
19 Fanatical 2.96 - —36 
20 Noncommittal 3.88 - —34 
21 Steady 6.55 "m 40 
22 Flexible 6.90 + 42 
23 Inflexible 2.90 E —51 
24 Vacillating 4.13 es -928 
25 Modest 5.56 Te 25 
26 Confident 7.08 + 34 
27 Self-Dispa 3.70 = —386 
28 Conceited 2.92 E —35 
29 Thrifty 5.79 + 09 
30 Generous TIS UR 18 
31 Stingy 2.84 = -4 @ 
32 Extravagant 4.44 = —25 
33 Skeptical 5.07 d —30 
34 Trusting 7.00 5 38 
35 Distrustful 2.72 = —49 
36 Gullible 3.62 zT zi i 
37 Selective 5.94 TS 10 
38 Tolerant 7.56 + 48 
29 Choosy 4.34 z3 -11 
2 Undiscriminating 4.20 - "s 
Firm 6.55 
42 Lenient 5.77 E 25 
43 Severe 3.05 3 —35 
44 Lax 3.92 = —30 
45 Discreet 6.93 T 28 
46 Frank 6.07 ou 15 
47 Secretive 3.96 Es -3. i 
48 Indiscreet 3.42 B. —30 
wo Individualistic 7.62 ¥ m X 


e 
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Z 


50 Cooperative 6.81 + 46 
51 Uncooperative 2.71 - —50 
52 Conforming 4.48 - 06. 
53 Pragmatic 5.01 + —09 
54 Idealistic 6.56 + —04 
55 Opportunistic 4.59 - 05 
56 Unrealistic 3.20 — —22 
57 Cultivated 6.46 = 16 
58 Natural 6.81 + 37 
59 Artificial 2.54 - —59 
60 Naive 3.78 - —16 
61 Thorough 6.48 + 27 
62 Careless 3.40 - —53 
63 Fussy 3.70 - -12 
64 Moral 6.44 + 36 
65 Immoral 2.89 = —39 
66 Self-Righteous 3.08 - -12 
67 Curious 7.33 T 25 
68 Uninquisitive 2.86 - —18 
69 Nosy 2.94 - —28 
70 Forceful 5.52 + —02 
71 Submissive 4.08 - —17 
72 Domineering 3.16 ES -18 
73 Peaceful 6.89 dt 33 
74 Belligerent 2.69 - —37 
75 Passive 3.69 - -18 
76 Polite 6.73 + 16 
77 Rude 2.69 - —39 
78 Ingratiating 4.10 - -22 
79 Intelligent 7.60 + 20 
80 Stupid 2.24 = =45 
81 Crafty 4.78 - rA 
82 Foresighted 6.77 EIS Hh 
83 Short-sighted 3.20 ra i 18 
84 Scheming 3.32 = m 
85 Meditative 6.07 T 3 
3 Unmeditative 2 m Zgo 
Broodi . EY 
88 on 6.70 aa dT 
89 Humorless 3.02 oy L98 


scription is considerably larger than the percentage of students not 
rating the trait. It should be emphasized that both the ratings and 


self-descriptions were done anonymously. The obvious implication 


is that more students tend to be cautious when asked to make & 
judgment about themselves than when they are asked to make 8 


judgment that has no self-reference. ; f 
For each trait, the percentage of the combined group of subjects 


Tesponding True to the trait was obtained. These percentages are 
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based only on the responses of those subjects who indicated that 
they knew the meaning of the trait by answering it either True or 
False. Each of these percentages may be regarded as an estimate 
of the probability that a given trait will be answered True in self- 
description by those subjects who believe they understand the 
meaning of the trait. 

The keyed response to each trait was taken as True and assigned 
a score of 1 and a False response was assigned a score of 0. Inter- 
correlations were then obtained between each of the 90 traits, the 
correlations in all cases being based only on the responses of those 
subjects who had indicated they knew the meaning of the traits 
by marking them either True or False. 

The 90 x 90 correlation matrix, with ones in the diagonal, was 
then factor analyzed by the method of principal components. Table 


4 gives the unrotated factor loadings of the traits on the first 
principal component.2 


Results and Discussion 


Figure 1 shows the relationship between P(T) for each of the 
90 traits and the SDSVs of the traits. It is obvious that P(T) tends 


1.00 


SDSV 
Figure 1. Relationship between P(T) and SDSV for 90 traits. 
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? The first princi total 
variance. pal component accounted for 9.28 per cent of the 
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to increase with the SDSVs of the traits. The product-moment 
correlation between P(T) and SDSV is .92 and is fairly typical of 
the correlations previously reported between these two variables 
(Edwards, 1967). 

In Figure 2 the unrotated loadings of the traits on the first 
principal component are plotted against the SDSVs of the traits. 
Those traits with low or socially undesirable scale values tend to 
have relatively high negative loadings, whereas those traits with 
high or socially desirable scale values tend to have relatively high 
positive loadings. The product-moment correlation between the 
SDSVs and the unrotated first factor loadings of the traits is .90. 
Thus, the signed first factor loadings of these single item scales are 
linearly related to the intensity or SDSVs of the scales, a finding 
that is consistent with the results obtained by Edwards and Walsh. 

Table 4 gives the evaluative sign of each trait, the sign being re- 
garded as positive if the trait has an SDSV > 5.0 or as negative if 
the trait has an SDSV < 5.0. It may be noted that the signed load- 
ings of the traits on the first principal component tend to alternate 
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Figure 2, Relationship between factor loadings on the first principal com- 


Ponent and SDSVs for 90 traits. 
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systematically in accordance with the evaluative signs of the trait, — 
This finding is in accord with an interpretation of the first prin- — 
cipal component as an SD-SUD factor. 

The frequency distribution of the 40 Type I correlation coefficients 
is given in column (1) of Table 5. As pointed out previously and as 
shown in Table 2, trait consistencies and SD consistencies are com- 
pletely confounded in Type I correlation coefficients and both trait 
and SD consistencies should result in negative values for Type I 
coefficients. The average value of the Type I coefficients is —.24 
and only four of the 40 coefficients are positive. 


TABLE 5 
Frequency Distributions of Correlation Coefficients for Pairs of Traits 


a) (2) (3) (4) (5) (6) 
T f f f f f f 
m A AO a 4.5. f — 7 

+35 1 2 

+30 1 2 4 

+25 1 11 

+20 2 5 15 

+15 2 6 15 

+10 6 10 9 13 

:05 6 7 4 5 3 

.00 4 10 4 5 1 5 
—.00 2 5 5 2 7 
—.05 4 4 5 7 
-:10 E 5 3 1 13 
—.15 6 3 1 5 
—.20 4 5 
— 25 3 1 1 
—.30 3 2 
—.35 2 1 
—.40 2 
—.45 2 
—.50 1 
—.55 3 
—.60 1 

UA SS NN V o 


—24 =.01 .06 42 -19 —-10 
bong 4 eee Pe OO pv AR Ra C18 =. E 
8 48 .09 Bi Al .08 15 


(D. Pairs of traits opposite in descriptive similari AS i 
i p rhes ptive similarity and opposite in evaluative sign. | 
@ Puoi quits opposite in descriptive similarity but with the same evaluative sign. i 
(4) Pairs of traits from sins Similarity but opposite in evaluative sign, "m 
to or greater than 7. Sets nd presumably different traits but with SDSVs eq ; 


(5) Pairs of traits from different sets and Presumably different traits but with SDSVs less 


30. 
6) Pairs POL A 
M ERU AR different sets and presumably different traits but with opposite 
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With respect to Type II coefficients, trait consistencies should 
result in negative and SD tendencies in positive correlation coeffi- 
cients. The frequency distribution of the 40 Type II coefficients is 
given in column (2) of Table 5. It may be noted that 22 of the coeffi- 
cients are positive and 18 are negative and that the average value is 
—.01. This finding offers no convincing evidence that trait con- 
sistencies, in general, dominate SD consistencies or vice versa. 

For Type III coefficients, SD consistencies should result in nega- 
tive values and trait consistencies in positive values. The frequency 
distribution of the 40 Type III coefficients is given in column (3). 
The average value of the Type III coefficients is only .06, with 13 of 
the 40 coefficients being negative. A reasonable hypothesis as to why 
the average value of the Type III coefficients is only .06 is that a 
considerable number of the subjects gave consistent SD or SUD re- 
sponses to the paired terms. For these subjects the response patterns 
would be either TF or FT. If these response patterns occur with any 
degree of frequency, the trait consistent response patterns, TT and 
FF, will be reduced accordingly. Thus, consistent SD or SUD re- 
sponses would tend to lower the correlation between the paired 
terms. Note, for example, that when trait consistent and SD con- 
sistent tendencies operate in the same direction, as in the case of 
Type I coefficients, the average correlation between the paired 
terms is —.24. 

There are nine trait terms with SDSVs equal to or greater than 
7.0 and such that each of the terms is from a different set. Thus, we 
have: 


Trait No. Trait Set SDSV 
13 Alert 4 7.14 
18 Open-Minded 5 8.12 
26 Confident "à 7.08 
30 Generous 8 7.18 
34 Trusting 9 7.00 
38 "Tolerant 10 7.56 
49 Individualistic id ra 
7 Tacit 22 7.60 


79 Intelligent 
The fact that these traits were classified in different sets by Pea- 
body and that the terms within a given set are presumably related 
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to a common attribute would seem to imply that these nine traits 
should be less highly related than the trait terms within a given set. 
Calculating the correlation coefficient between each pair of the nine 
traits, we obtain 36 correlation coefficients. If SD consistencies are 
operating then, in general, the average value of the 36 coefficients 
should be positive. The distribution of the 36 correlation coefficients 
is shown in column (4) of Table 5 and the average value is .12. This 
average is higher than the average value, .06, between pairs of traits 
with descriptive similarity but with opposed evaluative signs. 

A dynamically oriented trait psychologist could undoubtedly offer 
a plausible explanation of why the nine traits tend to be positively 
intercorrelated, If the nine traits are regarded as a single scale in 
Which the True response to each trait is the keyed response, then 
high scorers on this scale might be described as “well-adjusted,” 
as having a high degree of “ego-resiliency,” and as being relatively 
free from “anxiety,” and the like. All of these dynamic descriptions, 
however, are themselves socially desirable descriptions and, in es- 
sence, the high scorer on this scale would be one who tends to give 
SD responses in self-description. 

If we select terms with SDSVs less than 3.0 and such that each 
term is from a different set, we have the following: 


Trait No. Trait Set SDSV 
19 Fanatical 5 2.96 
23 Inflexible 6 2.90 
28 Conceited 7 2.92 
31 Stingy 8 2.84 
35 Distrustful 9 2.72 
51 Uncooperative 13 2.71 
59 Artificial 15 2.54 
65 Immoral 17 2.89 
69 Nosy 18 2.94 
74 Belligerent 20 2.69 
77 Rude 21 2.69 
80 Stupid 22 2.24 


Again, because these trait terms are from different sets, it might be 
assumed that they have less in common than the terms within 4 
given set. Regardless of the differences in the traits, if SD con- 

cles are primary, then the average correlation between these 


| 


ALLEN L. EDWARDS 751 


pairs of terms should be positive. The distribution of the 66 cor- 
relation coefficients is shown in column (5) of Table 5 and the 
average value is .19. Again we note that this value is higher than the 
average value, .06, between traits with descriptive similarity but 
with opposite evaluative signs. 

If we eliminate “generous” and “trusting” with positive evaluative 
signs and “uncooperative,” “fanatical,” “stupid,” ‘nosy,” and ‘con- 
ceited” with negative signs, we have a group of seven traits with 
positive signs and another group of seven traits with negative signs 
and such that no trait with a positive sign is from the same set as a 
trait with a negative sign. If SD consistencies are operating, then 
the average correlation between terms in these two groups should 
be negative. The distribution of the 49 correlation coefficients is 
shown in column (6) of Table 5. The average value is —.10. 

The 90 traits investigated in this study are the same as those used 
in an earlier study by Peabody (1967). In the present study the 
Objective was to investigate trait and SD consistencies when sub- 
jects are asked to describe themselves by answering True or False 
to items or trait terms. In the Peabody study, however, the subjects 
were told that another person possessed a given trait and they were 
then asked to judge how likely it was that he possessed one of two 
other traits, For example, the subjects were told that another per- 
son was accurately described by “cautious” and were then asked to 
judge whether the same person was more likely to be “timid” or 
_ "bold." For 70 critical judgments in which a given trait was to be 
judged on a bipolar scale of two other terms from the same set so 
that descriptive and evaluative similarity opposed each other, as 
when “cautious” is to be judged on the scale “timid-bold,” Peabody 
found that the mean ratings were all in the direction of descriptive 
Similarity and away from evaluative similarity. He regards this 
finding as evidence that when subjects are confronted directly with 
a choice between descriptive similarity and evaluative similarity 
that descriptive similarity is decisive over evaluative similarity. 

It may be noted, however, that Peabody’s method of data col- 
lection is essentially a forced-choice technique. For example, the 
subjects are told that another person is “cautious” and now must 
‘choose between “timid” and “bold.” It seems not at all unreasonable 
if a choice must be made between “timid” and “bold” that subjects 
Would in general favor “timid” over “hold,” simply because “timid” 
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is closer in meaning to “cautious” than it is to “bold.” On the other 
hand, the True-False format used in the present study permits a 
subject to answer True to "cautious" without the necessity of also 
answering True to “timid.” 

Rather than concluding that descriptive similarity is decisive 
over evaluative similarity, one might interpret Peabody’s results 
as indicating that if subjects are given one word or term and then 
forced to choose between two other terms, one of which is similar 
in meaning and one of which is opposite in meaning, they will in 
general favor the word which is similar in meaning. It does not 
follow that simply because subjects believe “stingy” to be more 
similar in meaning to “thrifty” than “generous” that these two 
traits are necessarily related in the real world of people. That they 
are not for the subjects involved in the present study is shown by 
the fact that the correlation between “thrifty-stingy” is —.02. 
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FACTOR ANALYSES WITH VARIABLES 
OF DIFFERENT METRIC 


JOHN L. HORN 
University of Denver 


Brnavrionan scientists will generally agree that for most behav- 
ioral variables the origin and the unit of the measurement scale are 
arbitary. It is on this basis, largely, that investigators transform 
Obtained scores to standard scores, as in calculating product mo- 
ment correlation coefficients, before applying the algorithms of 
factor analysis. Certainly if the origins and units of measurement 
are arbitrary it is desirable to have them such that the variables 
are as easy to work with as is possible and in the algorithms of 
factor analysis standard scores are easier to work with than are 
Taw scores or deviation scores. 

Rather in contrast with investigators who work extensively with 
factor analytic methods—on problems that are at least tangentially 
in the area of scaling—those who work primarily on what might 
be called “pure scaling” problems, as exemplified by the pro- 
cedures which Torgerson (1958) describes, do not transform scores 
to standard score form before carrying out procedures analogous to 
those of factor analysis. This, of course, follows from the fact that 
the problems in the pure scaling area are often conceived of as 
those of finding an origin and unit of measurement. Investigators, 
such as Tucker and Messick (cf. Tucker, 1958; Tucker and Mes- 
Sick, 1963; Messick, 1961; Messick and Abelson, 1956), who have 
Worked extensively in both the factor analytic and pure sealing 
areas, have brought to our attention the fact that the assumptions 
underlying a decision to standardize raw scores, as well as the as- 
sumptions underlying a decision not to do this, frequently are not 
Made explicit and may not be particularly sound. 
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Although the unit and origin of psychological variables m 
must be regarded as arbitrary, the mean and standard deviation 
are not arbitrary but indicate the level and scatter of performance 
in a particular sample of observations. By removing these in 
standardization of variables one thereby throws away potential 
sources of influence on a factor analytic solution. It is true that often 
it is desirable to eliminate these sources of influence, but it need not 
always be so. For example, in the study of change in repeated ob- 
servations on the same entities (cf. Horn, 1966; Horn and Little, 
1966), the level and scatter on different occasions represent in- 
fluences which probably should be recorded in the patterning which 
the investigator seeks to identify by use of factor analytic proced- 
ures, Also, as we acquire a good understanding of a domain of 
variables and use these variables in replicative studies on rather 
different samples of people, it may become interesting to study the 
differences in factorial structure that are associated with sample 
differences in means and variances. Similarly, as will be elaborated 
below, when working with dichotomous (present versus not present) 
variables, the patterning which can be found among the interre- . 
lationships of raw-score forms of these variables will, in some 
cases, be of interest. 

In any case it seems that more and more investigators are again 
asking the question “Should I or should I not standardize my scores — 
before doing a factor analysis or something comparable to this?” 
The present paper represents an attempt to formulate some of the 
issues implied by this question, not in a definitive mathematically 
elegant way, but in a way which will relate the problem to ques- 
tions of a substantive nature. 


Factorial Solutions for Raw, Deviation 


and Standard Scores : 
Assume that the observed scale value pa for entity t (eg: 
person i) (i = 1, , N) on variable j (j = 1, , n) is to be 


understood in stds of a factor analysis model in -— observed A 


scores are approximately reproduced by a weighted summation o 
m common-factor scores: 


Dis Ci aufi + Qiafai + +++ + auus o 
To represent this assumption in matrix form, we may cast the 
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values into an n by N matrix of pa scores, P, an n by m matrix of 
aj, factor coefficients, A, and an m by N matrix of fn factor 
scores, F, whence 


Pe AF Q) 
To obtain estimates of the A and F values it is customary to 
begin with the Gram product 


gis y PP @) 


To this point nothing has been assumed about the scale values of 
P except that they are such that one would feel justified in com- 
puting the sums of squares and cross-products as indicated in (3). 
The scale values might be in obtained form, in which case (3), 
represents raw score cross products and sums of squares, It is 
possible that such obtained scale values will be meaningful in raw 
score form even though they are arbitrary. For example, dichoto- 
mous zero or unity scale values assigned to represent “Nay” and 
“Yea” votes of U. S. Senators on bills presented to the Congress 
might be rather meaingful because merely by counting the number 
of unity scale values for a particular bill one obtains a clear and 
direct indication of whether or not the bill passed. In this case, too, 
the cross-product divided by N [as in equation (3)] for bills A and 
B has a rather clear and direct interpretation—it is the proportion 
of Senators who agree in voting “Yea” on the two issues. Thus in 
this kind of a situation one might seriously consider doing factor 
analysis, or something similar to this, on the obtained scores. The 
question then becomes: “What kind of variation would be repre- 
sented in the resulting factors?" 

As is well known a least squares approximation of P may be 
found such that 


G ELE (4) 


Where L? is an m by m diagonal matrix containing the m nonzero 
latent roots of G and E contains the assoicated scaled latent vectors 


(ie, ETE = I). Also, 
A - EL (5) 
and since LL’ = L/L = I2 
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G = (EL(EL) = AA” (6) 
and 


P c ELF (Q) 
(where it is assumed that F F7" = I, i.e., factors are uncorrelated 
and in standard score form) 
Division by N in equation (3) makes it possible to express the 
elements of G in terms of descriptive statistics which are well known 
in psychology. Thus the diagonal elements are of the form 


N 
»» Pu 
N 


where S represents the variance of variable j and M; represents 
the mean. In the off-diagonal sections of G the values are 


N 
» Pipu 
EGENT = PS; Sa + M;M, (9) 


-8/^cL-uj (3) 


This indicates that in faetor analyzing G for scale values that have 
not been transformed to standard scores, one is analyzing coeffi- 
cients involving level and scatter influences (ie., as indexed in 
M; and S, respectively) as well as the shape influences represented 
by rx (see Cronbach and Gleser, 1953). 

Calculating the first centroid factor coefficient! for variable j by 
the usual procedure yields 


dn = SÈ HM? +S, Eras + M, EM, i (10) 
V YXS + MY) 42> ra SS+ 29 MM 


It can be seen that if the means all happened to be zero (e.g., if the 
scale values were in deviation score form), this would be 


E Sj TS; 3ra8. (11) 
ta = =a * 
V 258. T2 3 raS.S, 


1 For present purposes it is convenient to express the idea of factoring @ 
in terms of methods for which the algebra is relatively straight forward, ae 
factoring by the centroid procedure with unaltered main diagonal entries. The 
results obtained under this kind of simplification differ in detail from the 
results suggested by examination of more complex models, but the gent 
findings of concern here are as well illustrated by the simple as by the comp’ lex 
procedures. 
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If, additionally, all variances happened to be unity (e.g., if scale 
values were in standard score form) ; the usual centroid factor 
coefficient would be obtained 


Ex le (12) 
Vn +2 D Tin 

It is evident that the factor coefficients determined on raw score 
cross-products matrices will be a function of the means and vari- 
ances of variables, as well as of the correlations. The factor. 
coefficients determined on covariance matrices, as in (11), will be a 
function of scatter (variance) and shape (correlation). In the 
usual factor analysis the focus is on shape alone. Thus, level, scatter 
and shape would be confounded in analyses of cross-product ma- 
trices for raw scores and scatter and shape would be confounded in 
analyses of covariance matrices (for deviation Scores). 


Partitioning the Affects of Level, Scatter and. Shape 
on a Factorial Solution 

These developments suggest that it might be useful to obtain 
factorial solutions for derived matrices in which confounding ele- 
ments would be eliminated or reduced. One approach to this kind 
of solution is to algebraically remove the terms which seemingly 
Produce the confounding. Thus, for example, given that one has 
calculated the raw scores cross-products matrix, W, and the co- 
variance matrix, C, the matric difference 


W-C=L (18) 
might be calculated to effect removal of scatter and shape, then 
Providing a matrix representing level influences alone. In terms of 
Scalar algebra this operation amounts to calculating 

ln = raS,8, + MjM, — raS = Mi (14) 
That is L is a matrix containing Mj? in the main diagonal and 
MM, in the off-diagonal sections. Thus L could be computed 
directly as the major product moment 


ay = 


M, 
L = || noy, M (15) 


M, 
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It can be seen, also, that the first centroid factor coefficient for 
variable j determined on L would be 


M, XM, 

mii »3 M, M, (16) 
and that the first factor vector of such coefficients would perfectly 
reproduce L. Thus, the factor coefficients pertaining to level alone 
are simply those of the vector of means. If the origins and units of 
measurement scales are arbitrary, this is a vector of arbitrary 
values. 

If a factoring of raw score variables were carried out, complete 
with simple structure rotation, it is possible that one of the factors 
would mirror the vector of means which is solved for in L. More 
likely, perhaps, the variance represented by this vector would be 
distributed through several factors. To ascertain the extent to which 
level was affecting the simple structure factors, these factors would 
be compared for similarity with the mean vector, as with the 
coefficient of colinearity described in equations (22) and (23) be- 
low. More practically, the implication is that in interpreting fac- 
tors one would keep the vector of means in the corner of his eye, 80 
lo speak, and where indicated by its similarity to a factor, allow 
level to enter into the factor interpretation. 

Since the covariance matrix can be written in terms of the cor- 
relation matrix 


C = SRS (7) 
(where S is a diagonal matrix in which the non-zero elements are 
the standard deviations of variables) and R can be regarded 88 
factored in accordance with equation (6) such that it is well-repro- 
duced by A A7, then 


B= 8A (18) 
and the covariance matrix can be regarded as well-reproduced by 
B B". This implies that a factor coefficient for C is the factor 
coefficient for R weighted by the standard deviation for the variable 
in question. 


b; = S;a5. a9 
This is backward reasoning, however; it is backward in the sense 
that it assumes that the rules of thumb adopted for deciding on rank 
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would lead to the same conclusion in the case of R as in the case of 
C and it is backward in the sense that it assumes that in writing 
the factor coefficient matrix one is concerned only to reproduce the 
Gram product matrix. The first of these assumptions probably will 
not bother the investigator who is primarily interested in sub- 
stantive implications. The second assumption is more difficult for 
this investigator to accept, however. In rotation of factors from C, 
the S; weights as well as the aj, coefficients could determine the 
simple structure—the groupings of variables upon which interpre- 
tations of factors would be based. In other words, although a factorial 
solution for the correlation matrix can be transformed into a par- 
ticular factorial solution for the covariance matrix (and vice versa), 
this latter would not, in general, be the factorial solution actually 
obtained by application of the usual factor analytic algorithms 
(including simple structure rotation) to the covariance matrix. 

If A is an orthogonal simple structure factor coefficient matrix 
based upon R, K is a similar solution based upon C and B has the 
meaning given in equation (18), then there exists an m by m ortho- 
onal transformation, T, which carries B into K and thus carries 
the scaled A into K; that is, 

K-BT 
= SAT (20) 
= SH 
(where, clearly, H = AT, a transformation on the simple structure 
of A). This suggests that to determine the influence of scatter on a 
factorial solution one would perhaps first obtain a simple structure 
on the factor coefficient matrix derived from R, transform this by 
the standard deviation weights of S, check the result for its Ama ii 
structure characteristics and if these were inadequate, rotate it to a 
simple structure. The effect of scatter on a factorial solution would 
then be indicated by a comparison of A and H. : 

To help describe the similarity of A and H one might use the 

Coefficient which Burt (1948) suggested for factor comparisons: 


HA TEM eats (01) 
C VAPHMATA VAATIVAA) 
Recall, too, that in terms of unitary vectors and roots, 
ATA = (EL)"(EL) = E'U'E (22) 


760 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


This kind of coefficient has been used rather extensively by Tryon 
(cf. Tryon and Bailey, 1968) particularly, but also by Tucker 
(1951), Wrigley and Neuhaus (1955), Cattell (1966) and others. 
In terms of scalar algebra, the coefficient is 


pe >i 29) | 
^7 VEM E. 
As Tryon pointed out, this kind of coefficient will be large if variables 
have similar relationships to factors in the domains represented by 
H and A. More mundanely, if factors in the two solutions are ar- 
ranged in an order such that the largest coefficient in any row will be 
the diagonal element, d,,, then the similarity of the two solutions 
is indicated by the extent to which D approaches diagonal form. 
Conversely, if D is not approximately in diagonal form, then the 
varying standard deviations for variables affect the factorial clus- 
tering of variables. Of course if variances—and hence, standard 
deviations—are basically arbitrary, the difference between H and 
A will not be of substantive interest. 
Discussion 

The main point of these examples is to illustrate that factors 
computed on raw scores need to be interpreted with some knowledge 
of the variability of means and variances. The means and variances 
mirror the distribution forms for variables. Particularly with di- 
chotomously scored variables, these statisties represent rather di- 
rectly interpretable characteristics of distributions—characteristics — 
such as skewness and range of variability. It is conceivable that in | 
some problem areas an investigator will regard it as desirable to 
allow these characteristics to help determine a factorial solution. 
The procedures outlined here indicate ways in which he may a0- 
complish this. 

Perhaps it is worthwhile to mention in passing that even when 
working with an R matrix and thus with scores for which the unit 
and origin are the same 1.0 and 0.0 for all variables, different shapes 
of distribution affect the intercorrelations and, thus, the factorial 80- 
lution. When variables are dichotomous, for example, an fje CO- 
relation between two variables can be --1.0 and —1.0 only if the 
variances and the means for the variables are the same. Otherwise 
either the maximum rj, is less than 1.0 or the minimum is greater 
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than —1.0. Thus, if variables have approximately the same form of 
distribution—i.e., the same mean and variance (as well as third 
and fourth moments) the correlation among them will tend to be 
higher than if the means and sigmas differ notably. The size of 
factor loadings will thus, to some extent, be a function of similarity 
in shape of the distributions—even when variables have been stand- 
ardized. 

At a practical level differences in shapes of distributions may not 
be very important. Guilford (1941) showed that if the correlations 
among variables are less than .5, then differences in shape of dis- 
tribution of dichotomous variables are not likely to have much 
affect on the size of the zero-order correlations. However, it must be 
remembered that in factor analysis one is, in effect, dealing with 
partial correlations and thus rather small influences on zero-order 
correlations may have noteworthy affect on the determination of 
factors, particularily those which are extracted rather late in the 
factoring process. 
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AN ALTERNATIVE FACTOR ANALYTIC 
SOLUTION FOR WECHSLER'S 
INTELLIGENCE SCALES: 


A. B. SILVERSTEIN 
Pacific State Hospital 


For a factor analytic solution to offer maximal promise of pre- 
dictive utility, Peterson (1960, 1965) maintains that it must display 
at least two properties: descriptive efficiency and statistical in- 
variance. He has demonstrated a relationship between these two 
properties in the domain of personality, using data drawn from 
ratings and questionnaires. Over many comparisons of alternative 
solutions, those with only a few broad factors consistently showed 
greater invariance than those with many narrow ones. The purpose 
of the present study was to determine whether this relationship 
also holds in the intellectual domain, as sampled by Wechsler’s 
(1949, 1955, 1967) scales. 

Cohen’s (1957a, 1957b, 1959) factor analytic studies of the stand- 
ardization data for seven age groups on the WAIS and the WISO 
have become classics in the field. He factored the subtest intercor- 
relations for each age group by the complete centroid method, 
applied three criteria for the completeness of factor extraction, and 

* retained five factors (for all but one age group) for graphic rotation 
by the method of two-dimensional sections. 

Cohen acknowledged that the final factor or two accounted for 
only a small proportion of the total variance, but he held that their 
consistent appearance made then unquestionably real phenomena. 
With equal justification, however, this argument can be reversed: 
while his smaller factors may be perfectly real, from the viewpoint 


ae A 
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of descriptive efficiency they are trivial. Moreover, although € 
claimed that essentially the same factors were found in boll 
WAIS and the WISC, much of his discussion deals with diffe 
in the results for different age groups. It may be that these differ 
were due to the emphasis on sufficiency, rather than efficiency, i 
solution. 


Method 

The material for the present study was in part the same as” 
used by Cohen: the standardization data for age groups 1f 
25-34, 45-54, and 60-75+ on the WAIS (11 subtests) and 7.5, 
and 13.5 on the WISC (12 subtests). The standardization dati 
age groups 4, 4.5, 5, 5.5, 6, and 6.5 on the WPPSI (11 subt 
provided additional material. These data were obtained from 
manuals, and from Doppelt and Wallace's (1955) account of 
WAIS standardization for older persons. Details of the analysi 
which they were subjected are given in the following section. 


Results and Discussion i 

The subtest intercorrelations for each age group were facto J 
the principal-factor method with squared multiple correlatio 
the diagonal (Harman, 1967). The average proportions of the t 
variance attributable to successive unrotated factors are sh 
in Figure 1. The general form of the curves is very similar to tha 
the curves presented by Peterson. Their most striking characte 
is the extreme negative slope between the first and second fact 
after the second factor, the slope is very low. Thus, inspection £ 
gests that it would be inefficient to retain more than two factors. - 
But there is a more objective answer to the question of the numi 
of factors to be retained. Kaiser (1960) has demonstrated i 
algebraic, psychometric, and psychological criteria are all met 
retaining as many factors as there are latent roots greater tl 
one, when the intercorrelations are factored with unities in 
diagonal. By this standard, two factors should be retained. 
nine of the 13 age groups, and only one factor for the rema 
four. In order to assess the invariance of the solution, how 
two factors were retained for rotation by the maxplane mêl 
(Eber, 1966) for each of the age groups.” 


2 Apart from the difference in the number of factors retained, this an4 
differed from Cohen’s primarily in the fact that it was carried out by 
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Figure 1. Average proportions of total variance Rad to successive 
unrotated factors. 

The average loadings of the subtests on Factors I and II, i.e., the 
elements of the reference structure, are given in Table 1. The sub- 
tests with the highest loadings on Factor I are the Verbal subtests, 
8nd those with the highest loadings on Factor II are the 
Performance subtests. The average proportions of the total variance 
of the Verbal and Performance Scales attributable to the two fac- 
tors are given in Table 2. Cohen concluded that the Verbal and 
Performance Seales do not constitute “the actual functional unities 
in intelligence test performance,” but the present findings are con- 
sistent with quite the opposite conclusion. 

Puter, The method of factoring that he employed was introduced as an ap- 


Proximation to the method used here, and conversely, the method of rotation 
Used here was designed to mimic the method that he employed. 
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TABLE 1 
Average Loadings of Subtests on Factors I and II 


eee 


I II 

ai Subtest ..... WAIB WIEO., WPPSI WAIS WISO "WET 
Information 66 67 53 04 —03 09 
Comprehension 60 57 57 01 —01 01 
Arithmetic 51 57 33 11 —01 27 
Similarities 55 60 54 09 —01  -02 
Digit Span 42 41 — 09 03 = 
Vocabulary 71 65 51 —03 02 06 
Sentences — — 47 — — 07 

Verbal 58 59 50 05 00 08 
Digit Symbol 36 — — 25 a = 
Picture Completion 32 19 19 34 33 34 
Block Design 13 13 09 53 53 46 
Picture Arrangement 30 31 — 32 27 = 
Object Assembly 08 —02 — 52 62 = 
Coding — 27 — -— 16 mel 
Mazes — ll 01 —- 40 49 
Animal House — — 12 — — 37 
Geometric Design > = 00 -= — 53 

Performance 24 17 08 40 40 44 


The issue is actually one of factor order, degree of complexity, oF 
level of analysis, and neither Cohen’s solution nor the present one 
can be said to be “right” or “wrong” on mathematical grounds 
alone. Factor I in the present solution represents a merger of 
Factors A (Verbal Comprehension I), C (Memory), and D (Verbal 
Comprehension II) in Cohen’s solution, while Factor II represents à 
merger of his Factors B (Perceptual Organization) and E (a 
quasi-specific). j 

The invariance of the two solutions was assessed by determining 
the coefficient of congruence between allegedly matching factors. In 
comparing the loadings from different scales, only those subtests 


TABLE 2 
Average Proportions of Total Variance of Verbal and Performance Scales 
Attributable to Factors I and II 
I Il 
Scale WAIS WISC WPPSI was wiso WESI | 
Verbal CX GENE Oe 
Performance 30 20 12 55 58 pe 


Note.—Joint influence of factors distributed equally between them. 


‘a 
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with the same name were considered common variables, e.g., 
despite their obvious similarity, Digit Symbol in the WAIS and 
Coding in the WISC were not so regarded. The average coefficients 
in the present solution were .97 for Factor I, .91 for Factor II, and 
.94 for all alleged matches. The average coefficients in Cohen's 
solution were .74, .85, .64, .52, and .27 for Factors A through E, 
respectively, and .62 for all alleged matches. 

It appears that the relationship between descriptive efficiency and 
statistical invariance holds in the intellectual domain as well as in 
the domain of personality. As Peterson suggests, however, the 
crucial test of alternative factor analytic solutions is their predictive 
utility, and that test remains to be performed. 
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CLUSTER ANALYSIS OF SEMANTIC 
DIFFERENTIAL DATA* 


RAYMOND L. JOHNSON 
American Institutes for Research 
AND 
DONALD D. WALL 
International Business Machines Corporation 


Tue geometric model upon which the semantic differential mea- 
surement technique is based prescribes an approach to the problem 
of semantic classification. Sorting words or concepts into categories 
on the basis of semantic equivalence involves three inter-related 
tasks: the definition of words, using an inventory of semantic fea- 
tures; the establishment of classes, for which the semantic features 
are criterial attributes; and the specification of rules for deter- 
mining the class membership of words. A quantitative approach to 
each of these tasks is implicit in the semantic differential tech- 
nique. It provides an inventory of semantic features in the form 
of 7-point scales anchored by pairs of contrasting adjectives. Words 
from a lexical sample are “defined” by rating them on each of these 
scales, and the degree of profile similarity between any pair of 
words is a measure of the extent to which the words are similar in 
meaning. A geometric interpretation of these data specifies that 


words be represented as points in a multidimensional semantic space, 
and the degree of similarity or synonymy between any two words 
may be depicted as the distance between their corresponding points. 
When we have computed the distances between all pairs of points, 


an examination of the resulting distance matrix reveals the fact 


ieee eet 
1This research was supported by The Office of The Surgeon General, U. S. 
t Research and Tevelopsiatt Command, under contract DA 49-193-MD- 
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that the points are not evenly scattered throughout semantic space, | 
Instead, points tend to occur in clusters, and it is this aggregative | 
tendency which may be exploited for purposes of classification. 
Since words which belong to the same cluster are defined by a 
similar pattern of semantic features, they may be considered to be 
similar in meaning, and hence to belong to the same semantic 
category. The problem of semantic classification is thus recast as 
the problem of detecting clusters in semantic space, and the solution 
yields a set of categories indigenous to the response data, not 
imposed from without on subjective or a priori grounds. Table 1 
summarizes the geometric model for representing semantic differ- 
ential data. 

A classification derived from semantic differential ratings has an 
applicability which is necessarily somewhat narrow because its | 
relevance is limited to those instances when an investigator wishes | 
to categorize words or concepts on the basis of affective similarity. | 
This restriction is imposed by the fact that the semantic differential | 
is considered to be a measure of qualitative meaning in a meta- 
phorical or affective sense (Miron and Osgood, 1966). The affective 
contents of concepts are often the salient semantic features for 
attitude research and clinical studies. One recent application of | 
cluster analysis to semantic differential data was Hofman’s (1967) | 
study of teachers’ attitudes toward such emotionally-freighted con- 
cepts as “creativity,” “discipline,” and “tradition.” The clustet 
analysis procedure described in the present paper was originally 
developed as a method to help psychiatrists interpret the delusional 
language of psychotics. Patients rated key words from their ow? 
interviews with psychiatrists, and the resultant classification estab- | 
lished categories for succinctly summarizing the content of the | 
interviews, and for interpreting the affective meaning of à patient/s 


TABLE 1 
Some Correspondences between Geometric and Semantic Space 
Adapted from Edmundson (1967) 

Geometric Space Semantic Space 
coordinate or axis semantic dimension _ 
origin meaningless or neutrality 
point word or concept, rds 
distances between points semantic similarity of WO 


cluster of points semantic class ae 
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delusional language by comparing it to his non-delusional language 
from the same interview. Such a classification of key words or con- 
cepts may be the terminal objective of a study, or it may be a pre- 
paratory step for further work. An example is the proposed use of 
semantic differential data in organizing dictionaries for computer- 
oriented content analyses (Stone, Dunphy, Smith, and Ogilvie, 
1966). 

A knowledgeable use of the semantic differential technique in 
attitude research, clinical studies and related areas requires a fa- 
miliarity with the measurement characteristics of the instrument, 
Two can be cited as being especially pertinent since they may pose 
problems for cluster analysis. There are certain differences in re- 
sponse “styles” from individual to individual, and a systematic 
variation in the relative distances between points from region to 
region in semantic space. Both merit further discussion and will be 
considered in turn. 


Response Polarization 


Semantic differential ratings may be subject to a type of response 
bias which complicates attempts to compare the meanings of con- 
cepts from one person to another. Polarity, the extremeness of 
response on a 7-point scale, has been assumed in the early literature 
to reflect intensity of meaning. However, this interpretation now 
must be modified since extensive use of the semantic differential 
has produced an accumulating weight of evidence that individual 
raters differ in their “scale checking styles.” The ratings of some 
people tend to be highly polarized toward the extreme outside inter- 
vals of a bipolar scale and to occur with relative infrequency in 
the intermediate intervals near the midpoint; this response bias is 
especially noticeable in psychiatric patients (Arthur, 1966; Marks, 
1965; Neuringer, 1963; Zax, Gardiner, and Lowy, 1964). Other 
raters, in contrast, are relatively unconstrained in their use of the 
middle positions on the scale. Individual differences in the tendency 
to use the extreme outside intervals can influence interpoint € 
tances sufficiently to make it difficult to compare one persons 


ratings with another's. d a 
One solution, suggested by Hofman (1967), is to transform the 
entry by the square root of 


distance matrix by dividing each cell i 
the sum of the squared distances, so that the squared entries sum 
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to unity. This normalizing transformation is applied equally to 
all distances entered in the matrix and a critical distance value 
(or threshold) is then chosen in order to determine cluster bound- 
aries. There is a drawback to this approach, however. The data may 
be systematically distorted due to the fact that the polarity of a 
word does not vary independently of its meaning, especially the 
evaluative aspect. Howe (1965) found, in analyzing semantic differ- 
ential data obtained from college students, that the mean polarity 
of positively valued, “good” stimulus words was significantly higher 
than the mean for negative words. A similar relationship between 
polarity and evaluative meaning has been found among schizo- 
phrenic patients (Johnson and Miller, 1966). Thus, available evi- 
dence indicates that polarity is not an uncontaminated measure of 
semantic intensity and cannot be “filtered out” or corrected statisti- 
cally without at the same time altering the rated meaning of a con- 
cept. 
Density Shift 

Another problem in the cluster analysis of semantic differential 
ratings may arise when we try to make intra-rater comparisons. 
An examination of data obtained both from normal and psychiatrie 
subjects has disclosed what we believe is a general property of 
semantic space: there is a change in the relative density of points 
along the main axis, the evaluative dimension. Points in the positive- 
valued (“good”) region of the space tend to be more numerous and 
more compactly clustered than are points in the sparsely populated 
negative-valued (“bad”) region. This thinning out in density may 
be explained, at least partially, by the fact that in many lexical 
samples, “good” words outnumber “bad” words, often by a margin 
of two-to-one (Johnson, Thomson, and Frincke, 1960). But what- 
ever the reason, this structural characteristic of semantic space 
complicates the task of cluster detection since the criterion for ; 
proximity must be a variable one: distance values which are large 
between points in the positive-valued region might be relatively small 
for those in the negative-valued region. 

These two measurement problems would seem to preclude the 
development of a straightforward fully explicit algorithm for the 
detection of clusters in semantic space. A significant role must be 
reserved for the individual investigator's judgement and expen" 
ence. A cluster analysis procedure can be programmed to provide 
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all the iterative calculations required and even to make tentative 
demarcations of cluster boundaries and provisional cluster assign- 
ments of isolated or borderline points. But many of the final 
decisions must remain subjective and somewhat arbitrary. In this 
paper we describe three methods of cluster analysis which, when 
used separately or in conjunction with one another, furnish in a 
coherent and convenient form the information upon which an in- 
vestigator may base his decisions. 


Method 


Thesaurus Construction 


Beginning with semantic differential ratings of each word in a 
lexical sample, we find the distance between every pair of words. 
There is, of course, a choice of distance metrics available, One which 
we have found to be convenient is a noneuclidean, city block model, 
with provision for incorporating arbitrary weight factors: 


n 
D; = D W, laa as an| 
Pest 


where Dy is the distance between point i and point j; aw is the 
assigned scale value of the ith word on the kth scale; aj, is the as- 
signed scale value of the jth word on the kth scale; n is the number 
of scales; and W; is a weight factor chosen either to emphasize or 
de-emphasize the contribution of the kth scale to the total distance 
value. Once all distances are calculated, a thesaurus is compiled 
which lists, for each word in turn, all other words in order of in- 
creasing distance. These distances, as listed in the thesaurus are 
next used in the detection of clusters. 


The Distance Histogram Method 


As a graphic means for detecting clusters a distance histogram is 
Compiled for each word i (using distance calculations listed in the 
thesaurus and a suitable choice of class interval) to show the 
distribution of distances from word i to all other words i. The 
histogram for a word embedded in a cluster will characteristically 
exhibit “early peaking” (ie. an initial modal hump along the 
abscissa near the origin), indicating that there is a relatively large 
number of points j within a short distance of point i. Other clusters 
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in the semantie space will appear as peaks farther out along the 
abscissa. An example of an early peaking histogram is shown in 
Figure 1. To identify the words which correspond to points in a 
cluster, we consult the thesaurus for a listing of those words which 
fall within the distance between 7 and the outermost point j of the 
cluster, as determined by the upper boundary of the modal hump. 


The Maximum Density Method 


"The first step alone may disclose a good many clusters. However, 
clusters detected by this method tend to overlap and additional 
analysis is needed to isolate disjoint clusters and to delineate their 
members. 

"The second step is to select from all key words with early peaking 
histograms those words which are centers of the most dense regions 


| 


| 


of the space. The caleulation of densities is a more sensitive method | 
for detecting small clusters in regions of high point-density, whereas | 


the distance histograms are most useful for detecting larger clusters 
in relatively sparse regions. Since semantic space tends to change in 


36 4 
BETWEEN-WORD DISTANCES 


Figure 1, An early peaking distance histogram for one word. The di 
bution of distances from one of the words to all others in the lexical i 


stri- 
ple 
(n = 108) is plotted as a histogram with between-word distances, along ily 
abscissa and frequency along the ordinate. The distribution exhibits ea is 
peaking, with a larger mean (2481) than median (18.00), indicating that ter 
word is centrally located within a cluster. The data is taken from 4 clus 
analysis of key words extracted from the text of a psychiatric interview. 
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average density from region to region, the two methods comple- 
ment each other. When used jointly, the histograms identify the 
most dense regions, and the density calculations identify a number 
of words within those regions which serve as nuclei of important 
clusters. 

However, the calculation of density involves a theoretical diffi- 
culty. Intuitively, the density of a region is the number of words 
(or points) in the region divided by the volume of the region. But 
what is an appropriate measure of volume in semantic space? 

The n-scales of the semantic differential determine an n-dimen- 
sional space, but redundancy and dependence among the scales 
will generally permit points which represent actual data to be 
effectively represented in a space of some smaller number of dimen- 
sions, q. The value of q may be determined by a factor analysis, or 
may be chosen to reflect a priori considerations. Most research with 
the semantic differential has found that semantic space can be 
effectively represented in two or three dimensions, provided the 
concepts represent a heterogeneous sampling (Ervin-Tripp and Slo- 
bin, 1966). 

To calculate densities, the volume of a region in semantic space 
which extends a distance, D, from any point i (ie., a sphere of q 
dimensions with radius D) is taken to be proportional to Dt. With 
point i as a center-point and the distance to each point j serving in 
turn as a radius, a density is computed using the formula: 


G, = P,;/Dii® 
where 


G, = a measure of the density of the cluster of points surrounding 


point 4 


P,; = the number of points within distance D,; of point i, including 
point 2 itself 


D,, = the distance from the center of the cluster, 2, to point j 

(i.e., the radius of the cluster) 

q = number of degrees of freedom, 
sionality of the space. 

The distance, Dy, which pro 


determines the cluster of maximum 
stances the cluster of maximum dens 


corresponding to the dimen- 


duces the maximum value of G 
density. Under normal cireum- 
ity about point i will involve a 
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relatively small distance and only a fraction of the total number of 
points. 

Those points j which are found in the most dense region about i 
constitute a cluster with 7 as its center. When two or more clusters 
are near to each other in the space the clusters may be found to 
share many of the same points. If there is an extensive overlapping 
in the membership of clusters, we merge the overlapping sets of 
points into a single cluster. The decision whether or not to merge 
clusters which have partial overlap in membership is, to some extent, 
arbitrary and depends on the amount of cluster definition re- 
quired by the particular classification problem. Tentative cluster 
boundaries are drawn to include those points which are most densely 
packed about a center point and to exclude more remote points 
(e.g., those situated in border areas between clusters) . 


A Probabilistic Approach 


Either method of cluster detection, described thus far, can provide 
the basis for a useful, if somewhat gross, classification of words or 
concepts, with a minimum of restrictions or limiting mathematical 
assumptions. But certain questions about the classification may 
remain unanswered. For example, to what class does one appro- 
priately assign a point which is isolated in semantic space, or which 
is located in an area where clusters overlap? If the specific classifica- 
tion problem requires that cluster boundaries be more carefully de- 
termined, a possible approach is to compute the probability that a 
point belongs to a cluster. The method involves measuring the dis- 
tance from a point to the center of gravity of the cluster and next 
finding the probability of obtaining a distance greater than or equal 
to the measured distance. The result is a probabilistic metric, and the 
assignment of isolated points to clusters can then be made on a basis 
of greater probability of cluster membership. After a cluster has 
been detected, using distance histograms and density calculations, 
the attribute data for member words from the semantic differential 
is expressed as data matrix A = (ay), with the number of rows 
(r) equal to the number of points in the tentative cluster and the 
number of columns (n) equal to the number of scales on the 
semantic differential. The original data matrix A is modified by 
subtracting the mean (m,) of column j from each element in column 
j to form matrix B = (bj), in which each column has a mean of 


JOHNSON AND WALL TU 


zero (ie. the origin of the space is moved to the center of gravity 
of the data points). Matrix B is next multiplied by the transpose 
of B to form the square matrix, W = BTB. 

The eigenvalues (y) and unit eigenvectors (U;) for W are now 
determined; A, will be non-negative and the U, will be mutually 
orthogonal. Small A, will correspond to directions in which the data 
have small variation and may be ignored. The number (q) of the 
X that are considered to be significantly large is the effective 
dimension of the space occupied by the cluster points. The q eigen- 
vectors (U;) corresponding to large A; furnish a set of coordinate 
axes for representing the points in a space of q dimensions. 

We next find the coordinates of the cluster points along the eigen- 
vector U; by computing vector dot products: 


n 
Zu bo. [UT 
k=l 
where 


1, = the coordinate of point 7 along vector U; 
ba = the entry in matrix B = (bx) for row ¢ and column k 
Ur; = the kth component of vector U; 


If the distribution of these coordinates along each of the g selected 
vectors U; may be assumed to be normal, and if different distribu- 
tions may be assumed to be independent, then it is possible to 
compute the probability that amy point i belongs to the cluster. A 
measure of distance, c,", of a point i with coordinates z, along 
U, from the center of gravity of the cluster is 


a 
cè =f- 1) D air MM: 

As an alternative to examining the distributions of zy along 
U,, in order to justify using c; as just described, the distribution of 
the values of c; for points in a cluster may be compared to the chi- 
Square function with q degrees of freedom, to see if this fit is 
acceptably good. 

Under the assumptions stated above, the variable c? has a chi- 
Square distribution with q degrees of freedom. This is the probabi- 
listio metric, To decide whether an isolated point should be as- 
Signed to a cluster, the quantity cg is computed relative to that 
cluster and compared with the appropriate value in a table of the 
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chi-square funetion with q degrees of freedom. The tabled value 
gives the probability that the point belongs to the cluster and yet 
has a distance from the cluster's center of gravity as great as cj. 
If a point is located in overlapping boundary areas between two or 
more clusters, the smaller value of c? computed for that point de- 
termines best assignment. 

In Figure 2 are shown the distributions of cj? values for a cluster, 
computed for two values of g. Two chi-square functions with corre- _ 
sponding degrees of freedom are plotted for comparison. The curves: 
are seen to be sufficiently close to permit us to accept the chi-square 
functions as a basis for determining the probability that any point. 
in the space is a member of this cluster. Specifically, the probability 
that any given point belongs to the cluster may be estimated by 
comparing its value of cj? with the chi-square function. For example, 
a point with a c? value of 1.00 (df = 3) has a .80 probability of 
being a member of the cluster, whereas a point with a c? value of 
5,00 has a probability of less than .20 of belonging. 
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Cluster Interpretation 


The final step following the detection of clusters is to interpret 
them. For a cluster to be interpretable, a joint consideration of its 
lexical composition and pattern of adjectival attributes must sug- 
gest a suitable characterization. Cluster interpretation is thus an 
exercise in inductive inference (similar to factor naming) and the 
result is several sense-characterizations which state as explicitly 
as possible the nature of the synonymy underlying each cluster. 
Because most variation in the data is usually along the “good-bad” 
axis, clusters and their corresponding semantic categories differ 
mostly in evaluative meaning. And since the positiye-valued region 
of the space is densely populated, clusters here tend to be larger, 
more amorphous in composition, and hence somewhat more difficult 
to interpret than are clusters in the sparsely populated negative 
region. 


Summary 


The cluster analysis of semantic differential data offers one ap- 
proach to the problem of semantic classification when the objective 
is to categorize words or concepts on the basis of similarity in 
affective meaning. This paper discusses some of the characteristics 
of semantie differential data which pose problems for cluster an- 
alyses, and describes three relatively simple methods of cluster 
detection: (a) the construetion of histograms which show the dis- 
tributions of distances among points in semantic space, (b) the 
determination of point density within regions of the space, and (c) 
the estimation of the probability that a given point belongs in a 
particular cluster. 
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No psychologist can be content—as can the mathematician—with 
the outcome of factoring a single correlation matrix from a single 
experiment. If restrictions have been applied—such as simple 
structure (Thurstone, 1940) or confactor rotation (Cattell, 1966)— 
to the purely mathematical rotation, the scientific model requires 
that the factors represent some influence or souree trait which 
should re-appear in any other experiment independently brought 
by the same rules to its unique resolution. Consequently, the next 
step in any factor analytic research, after obtaining the unique 
rotation and checking its statistical significance (Bargmann, 1953) 
is to ascertain the goodness of match with the results of whatever 
coordinated researchs have been strategically planned. ; 

Two situations are commonly met in factor matching (1) differ- 
ent subjects with the same variables and (2) the same subjects with 
different variables. The same subjects with the same variables (on 
two occasions) can be matched by direct correlation, though, of 
Course, the methods we have described are still applicable. Direct 
factor correlations can also be worked out between putatively match- 
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ing factors with the same subjects and different variables. With 
different subjects and different variables the problem belongs to 
Alice in Wonderland. Our concern here is therefore with (1) above, 
covering situations in which the same variables are used in different 
experiments and samples and matching must be via the variables. 
Parenthetically, the percentage of studies in psychology where 
enough common hypothetical markers are carried to permit reliable 
estimation of matching is scandalously small, those surveyed by 
Hundleby, Pawlik, and Cattell (1965) being rare exceptions. 

Three distinct principles have been proposed for evaluating good- 
ness of match by variables. (All, as Cattell has pointed out [1962], 
should be applied to the factor loadings of variables in the factor 
pattern matrix, V;,, not to the correlations in the structure matrix 
Vy». or the weights in the factor estimation matrix Vye). The first 
principle, which was proposed, with formulae, by Cattell (1949) 
depends on distinguishing categorically between salient variables, 
significantly and highly loaded, and nonloaded, hyperplane vari- 
ables. It then estimates the significance of a certain number of 
salients appearing in common to two factors (or collections of 
factors) when a given number of variables are common to the two 
researches. The second utilizes the numerical magnitudes of the 
loadings and, since a correlation coefficient would be inappropriate 
for this, employs Burt’s congruence coefficient. It has been utilized 
by the present writers and by Wrigley and Neuhaus (1955), and, 
since the distribution is not known, Schneewind and Cattell (In 
press) have generated such a distribution on the basis of empirical | 
objective test data. The third or configurative matching method 
aims to operate as if one were determining the correlation between 
two factors on the same subjects and therefore in the same space, 
but, since this space does not exist, it creates an ersatz common 
space by aligning the two test configurations with a least squares 
fit and determining the correlation (cosine) between the putatively 
matched factors from each (Cattell, 1965). 

On account of the special assumptions in each of these methods— 
and particularly of the capriciousness which our experience shows 
in the results from the congruence coefficient, which is the easiest 
and therefore most used index—we would suggest that the best 
work in this area should simultaneously apply two or three of the 
evaluations. In evaluating such evidence we should, moreover, con- 
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sider such arguments and evidence as have been put forward by 
Clif (1966), Tucker (1959), Wrigley and Neuhaus (1955), and 
the present writers elsewhere, (Cattell, 1966; Horn and Morrison, 
1965; Nesselroade, 1965). id 


Theoretical Arguments for the s Index 


A preference for the approach by salients—the s index—which 
we are here discussing has a theoretical justification in what has 
been called the factor mandate matrix (Cattell, 1966). This states, 
by 1’s and 0’s in a matrix,—that the nature of a factor is such that it 
either acts on a variable fully (in a linear relation) or does not act 
at all. The various loadings between zero and one (or more) that 
we get in an actual factor analysis, in lieu of the ones and zeros in 
the factor mandate matrix, express how much the full effect, (where 
a factor actually operates on a variable) is cut down by the 
presence of other factors which operate on the same variable and 
share the production of its variance. Even so, if hyperplanes had 
no “blur,” the 0’s in the factor mandate matrix would be exact 0's 
and the Vj, from any given experiment, and all other values, no 
matter how small, would be salients, i.e., factor-affected variables. 
Originally, the term salients has, admittedly, meant the variables 
which load a factor with conspicuously large magnitude, but in 
principle the salient variable similarity, s, should take all variables 
significantly out of the hyperplane. a n 

Conventionally, what is in the hyperplane has been arbitrarily 
taken at 3-10, Elsewhere, Cattell and Ford (in press) have suggested 
an advance on this arbitrariness, consisting of altering the width 
with various experimental parameters. Even if we call all loadings 
zero which fall in the correctly estimated, say +2.5 standard error 
of a zero loading, we should still not be correctly sorting out ac- 
cording to the true distinction between salients and nonsalients, 
identical with that in the factor mandate matrix. For some very low 
teal loadings will be taken into the nonsalients, but the percentage 
of variables misclassified should be very small. Granted that this 
error is small, the salient variable similarity index, $, approach 
should in principle be vastly superior to the congruence coefficient, 
Te For the latter is “deceived” and reduced by all the shifts in 
loading magnitude of a particular variable on a particular factor 
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due to the variance contribution magnitude of the other factors in 
the studies varying in their magnitude. 

Despite this appeal of the s index it has progressed only slowly 
since its initial proposal in 1949. It presents several statistical 
difficulties; both for the single factor and the total research match 
—one of which is that it looks like a chi square problem but is not, 
except in a more generalized sense than the usual one.! Some prog- 
ress on these was made by Cattell and Baggaley (1960) but certain 
problems connected with sign of loading have remained unsolved 
until the present article. 


Derivation of the Statistic 


As stated above it is desirable to start with the values—factor 
loadings—in the factor pattern matrix and to divide them for each 
of the two factors to be compared into hyperplane non-salients, 
positive salients, and negative salients. 

Two factors would now be maximally similar when, for the 
common variables of the two, cross-classification like that shown 
in Figure 1 fills only the main diagonal cells. In this case, there is 
agreement among salients with positive sign, among salients with 
negative sign, and among hyperplanes. Falling in the cell defined 


Factor 2 
PS H NS 

F 

a PS ni. 

c 

t H m. 

o eR —— 
r NS Ta. 

M 


PS = positive salient variable 
H = hyperplane variable 
NS = negative salient variable 

Ju is a joint frequency. The same n variables in both researches. 


Figure 1. Schematic Representati ification of the Common 
Variables of Two Factors, D ion of Cross-classification o! 


1 At the July of 1969 Oxford meeting of the Society of Multivariate eri- 
mental Psychology Jacob Cohen pesi gita K statistic to a weighte v 
square which appears to function very similarly to the present s index, thoug? 
apparently not giving an identical distribution to that here obtained empi"! 
[c a 
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by the PS row and the NS column in the matrix of Figure 1 is a 
more serious departure from alignment than is a nonmatch falling 
in the cell defined by the PS row and the H column. In fact, where 
the latter is one step from alignment this is two steps, for the 
variable is now a positive salient in one factor and a negative 
salient in the other. According to the principal of simple structure, 
variables in the hyperplane of a factor have only random relationship 
to the factor. Nonsalients with negative sign should occur with 
about the same frequency as nonsalients with positive sign. Thus 
the sign of nonsalient does not indicate the extent of disagreement: 
all hyperplane variables are in the same class, any departure from 
the center of the H row and column is only one step away from 
agreement. 

On this basis a modified s statistic expressing the degree of simi- 
larity of patterns of loadings is here introduced. Comparing the 
actual counts fy with chance expectations ey, the latter calculated 
from marginal frequencies of the fy matrix, seems a logical proced- 
ure. 

The numerator of the following formula does this for the four 
corner cells, two in the main diagonal corners representing a “good 
match,” two in the off-diagonal positions representing a “bad 
match”: Actual counts above chance represent & positive value, 
those below chance a negative value for the “good match” cells of 
salients; the opposite goes for the “bad match" cells. The hyper- 
plane-salient cells enter the calculation via expected counts (and 
via the denominator); the faz cell enters by “drawing” variables 
from the salient-hyperplane cells, thereby giving greater weight to 
the four corner f; counts against their expectations. Multiplication 
by two, and division by the denominator fix the s value at F1 tor 
a perfect match (i.e. maximum similarity), zero for chance (i.e. 
fy = ej), and —1 for maximum dissimilarity. The latter repre- 
sents, of course, also a perfect match of the two factors, only with 
Teversed signs of loadings in one of them. (Reflecting one of the 
factors simply “rotates” the matrix of observed counts by 90°, the 
“good match” corner cells thus becoming “bad match,” and vice 
versa) H F 

(fa — i (1) 


dU 2{[fu— e1) + (fas = €3)] ri (Gis "E &s) + 
ny + Ms +m. + Ms. 
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At this point one more consideration has to be introduced. In 
the fy matrix, according to the logic thus far followed, certain 
different cells should have exactly the same meaning as to their 
contribution to the similarity of two factors. That is, fi: and fss 
have the same meaning as do fıs and fsı; and also, fis, for, fos and 
fo». A variable which appears in fgg instead of f11 should not change 
the measure of similarity. However, this is not the case in the above 
formula when calculated on the “raw” observed counts. Due to the 
use of expected values, we get different results for different distri- 
butions of observed variables in the fy matrix even in the case 
where the differences are encountered only within the above- 
mentioned groups of cells (i.e., when the degree of similarity should 
not be affected). It can be seen that maximum values of s are 
found when the distribution of counts in fy is symmetric, i.e., when 
the number of positive salients in each of the factors is equal to 
that of negative salients. This unequivocal case has been chosen 
therefore to express the measure of the match; a "balanced" fil 
matrix is to be obtained prior to the calculation of ey and of s by 
a simple transformation on fy, namely, by averaging the counts 
between pairs of cells with the same meaning: 


fu! = fas! = fit In fe’ = fn’ = fa tin 


fu! = fas’ = fn + fa E a f? = fa’ = Ía + fn + 32 


2) 


i (fa! = fa of course) 
This balancing yields the desired equalities n. = ns, m = "s 
without affecting the logical contribution of any variable. This is 
perfectly consistent with the logic of the salient variable theory in 
that match of two negative salients counts the same as a match of 
two positive salients, ete. The fact that we may end up with halves 
of a variable in some cells after this operation does not have to 
puzzle the reader: the essential meaning is not distorted, and the 
numerical balance between salients is attained. 
When this “balancing” is applied, the above formula may be 
written in a somewhat simpler form (the primes are omitted for 
simplicity) : 4 


go ENNIO E SE ig a EC) 
fu + faa ae fis + far + ETUR Eu far T fas zx fas) 
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This corresponds to the original form under the condition that 
m, = Ns., Ta = Tog. 

The above formula defines a monotonic function along the simi- 
larity dimension we want to assess. The most important question 
now arising is that of significance of the s-values. This is con- 
sidered next. 


Procedures 


Significance of an s-value is conceived of as a departure from 
the probability of the value arrived at in the case of a purely chance 
relationship between the pattern of loadings in factor 1 and that in 
factor 2. The number of loadings n is always the same for a pair 
of factors compared; the number of salients (or the hyperplane 
count) may be different. However, in this article, the s functions 
are worked out only for the cases of equal hyperplane counts in 
both factors. (The solution in cases of different salient/hyperplane 
ratios in the factors can be estimated by taking the average of their 
hyperplane counts as a guide, until probability values of s for 
unequal counts are estimated.) 

The chance distribution for a given number of variables and a 
given proportion of hyperplane loadings is thought of as follows: 

For any pattern of loadings in factor 1, all possible permutations 
of loadings in factor 2 are equally likely to occur. For n variables, 
there are n! such permutations; working out the s values for all 
of them would provide us with a complete chance solution. Although 
this solution is simple in principle, the nls represent astonomic 
numbers for all but very small n’s. Even if we reduce the number of 
permutations by taking into account the fact that permutations 
Occurring within only one of the three categories of loadings do not 
change the pattern, we still have to calculate 


—— MM (4) 
^ — "^ — 
TEXTE 

(h being the number of hyperplane loadings) 
values representing the complete chance distribution. Such a num- 


ber again makes it impractical to obtain a complete solution when 


” approaches 20 or more. oe 
Therefore, a decision was made to estimate the probability values 
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TABLE 1 
Probabilities s > v, for Hyperplane Count 6097 


* Probability of any v such that —1 < v, < +1 is never zero; because of rounding the probability values 


the nearest thousandth of the sample, Probabilities <.0005 are indicated as zero. 


on the basis of random sampling from the total population of 
possible patterns of loadings, A program was written to: : 
(1) generate (on the basis of random numbers) a pair of loading 
patterns for a given number of variables and a given propor- 
tion of hyperplane loadings 
(2) calculate the s-value for this pair 
(3) repeat this operation any specified number of times (ie. N) 
(4) group the s values into class-intervals of .01, and 
(5) plot the cumulative frequency distributions on these results. 
Because the direction of the signs of the set of loadings in a fac- 
tor is completely arbitrary and does not affect the absolute walt 
of s, the distributions have been built as perfectly symmetric 1n 
steps (4) and (5) above by using absolute values of the generated 
indices to estimate probability values for both the positive and 
the negative tails of the distributions. ; 
Various sizes of N, extending from 250 up to 10,000 were tried 


out, to find a level where the frequencies of s values for the given 
n and hyperplane percentage became stable. For N = 5,000 the 
distribution attained a definite shape that did not change noticeably 
with increasing Ns. Therefore, the tables are based on samples of 
this size. 
| Tables 1, 2, 3, and 4 give the cumulative probabilities of possible 
values of s relative to the tabled values, indicated as v,, for n's ex- 
tending from 10 to 100, and for hyperplane counts of 60, 70, 80, and 
90% of n. The same probabilities apply to both the positive and the 
negative parts of the distribution. 
These facts led to the development of another unusual arrange- 
| ment of probability values in the tables. (This development arises 
also from the need to provide precision rather than adhere to sta- 
tistical custom.) 
A staircase-like cumulative distribution of s results from the 
. fact that the counts of variables in the joint occurrence matrix 
fy necessarily are discreet values. Consequently, the values of the 
s function have this shape. One may notice that for small ns and 
greater hyperplane count proportions there are only a few values 
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TABLE 2 
Probabilities s > v, for Hyperplane Count 70% 


E w: e7 .34 01 00 


pi .002 .052 .316 .500 
E ow .er .51 94 UT OOD 


p: .000 .002 .027 .135 .357 .500 


D vw: so .45 .34 .23 .12/ .00. .00 
— P:  .000 .002 .016 .064 .190 .383 .500 


4 y. 00 
i .51 42. .834 26 .17 09 OL 
: P:  .000 .002 .007 .034 .098 .222 .403 .500 


D 
Poo: a 1 Sd T ime nO Ot) 
— P:  .000 [001 004 .018 1052 .129 .247 .407 500 
P o: 01 -00 
175.89 94 .28 28. O 4 
?: 1000 [002 .007 .025 .066 .142 .262 .415 .500 
05 .01 .00 


.38 .94 ..30 .20 .21 17. «38 
P: —.000 .001 .002 .007 .019 .046 .092 a 1283 .425 .500 
o7 .04 .01 .00 


, v: 3 S 5 EU 
| 34 31. .27 .24 21 r^ 7009 .125 .207 .318 .438 .500 


-— P:  .000 .001 .002 .006 .016 


nm 
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TABLE 3 
Probabilities s > v, for Hyperplane Count 80% 


n- 
10 v. .51 01 00 
p .012  .187  .500 
20 v. .76 51  .20 01  .00 
p .000 .003  .041 .279  .500 
30 va: .67 (hee bo) Si Wg 001  .00 
P .000 .001  .011 .083 .316  .500 [ 
40 DE .51 38 t.20 13 .01  .00 
p .000 .003 .024 .115  .347  .500 
50 v: AKAA ANAE: I AEL 01 00 
p .000 .001 .006 .036  .142 .361 .500 
60 v. .42 86 .:298 17.00  .01 00 
P: .000 .002 .012 .052 .164 .369  .500 
vs REBR A A 19 Bo 0] 
P .000 .001 .004 .022 .076 .199  .391 
100 vu: BIDUO .1 .0  .0 
p: 000  .002 .010 .036 .105 .220  .402 
TABLE 4 
Probabilities s > v, for Hyperplane Count 90% 
n= 
10 v: .01 .00. 
p: .052 .500 
20 D .51 .01 .00 
p: .003 .099 .500 
30 v: .67 .34 .01 .00 
P .000 .007 -133 .500 
40 v: .51 .26 .01 .00 
p: .000 .012 .107 .500 
50 v. ES 21 01 .00 
p: .000 018 198 .500 
60 n: ¿51 .34 17 .01 
i .000 .002 .029 .217 5 
80 wo d .88 .26 13 .01 
p: -000 .004 045 .251 500 
100 De .31 21 1 .01 00 
p: .000 .007 061 .286 5 
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s can possibly acquire; their number increases with m and with 
decreasing disproportionality among the three categories of load- 
ings. 

Users of this first set of tables will face a few problems. Inter- 
polation of probability values for n’s between those provided in the 
tables should not be too difficult; there is a regularity in “shift” of 
the probabilities with changing n. Similar regularity may be noticed 
also when moving from one level of hyperplane percentage to an- 
other. So far the two preceding kinds of interpolation are left to 
the intelligent estimation of the researcher (of course, one can 
always “be sure” by simply using the more conservative of the two 
tabulated values between which the actual parameter lies). More 
difficult will be a case in which the proportions of hyperplane 
variables are considerably different for the two factors; averaging 
them, as suggested earlier, may not be a correct procedure. Answers 
to these questions will need to await development of new sets of 
tables for additional levels of parameters and their combinations, 
or the derivation of analytic solutions. 
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PLOTTING ANOVA INTERACTIONS FOR EASE 
OF VISUAL INTERPRETATION 


JULIAN C. STANLEY 
The Johns Hopkins University* 


Cuzary and Hilton (1968) found the interaction of test item with 
race statistically significant, though not accounting for an appreci- 
able percentage of the total variation. They plotted difficulty of 
each item for Negroes against difficulty of each item for whites 
and used this bivariate scatterplot to interpret the interaction. 
Though their procedure is tantamount to plotting the item means 
for each of the races separately on the Y-axis, this latter procedure 
tends to be more easily interpretable visually, because it can directly 
reflect the results of the analysis of variance. 

As an illustration, compare their Figure 3 (Figure 1 in this note) 
with Figure 2 here, which depicts the interaction more readily 
while also preserving the item-difficulby differences—an added nicety, 
though not necessary for viewing the interaction itself. Fifty Car- 
tesian-coordinate points are plotted in Figure 1; 100 race-points are 


plotted in Figure 2. 
For Figure 2 the scale of the X-axis is the mean difficulty of 
ined.? If one plotted 


each of the 50 items for Negroes and whites combined.” : 
these same values on the Y-axis, the 50 bivariate difficulty points 


Meng aC IA 
p iL wish to thank Gerry F. Hendrickson forge 

and 3, and Robert Wang for the final version. T. 
man, and Marilyn D. Wang made several valuable 


earlier draft of the note. R j 

2 For all three figures in this note, the difficulty of an item 1s the sum of 
all the points earned by the examinees on it, If an examinee answers & ae 
five-option multiple-choice item correctly, he earns one point. If he marks 1 
incorrectly, he earns —%4 points. If he omits the item, he “earns” zero points. 
Thus, for a few very difficult items the sum of the points earned is negative. 
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_ Figure 1. Bivariate plot of sums for PSAT math items—Group I sample 
size for each race: 318. 


would form a straight line with 45° positive slope. For each item, 
the item difficulty for one race must lie as far above this mean-of- 
the-races line as the item difficulty for the other race lies below it. 
If there were no interaction, the item difficulties for the races would 
form (in the population) two straight lines, each of 45° positive 
slope and each parallel to the bisecting 45° line of the overall means. 
Even with zero interaction in the population, however, sampling 
fluctuations will cause the difficulty points for each race to Vary 
randomly around a, straight line. 

, For nonzero interaction, as in the present example, the two item- 
difficulty distances will not be constant from item to item. Ja 
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POINTS EARNED BY EACH RACE 


-40 
-40 -20 O 20 40 60 80 100 120 140 160 180 200 2 


AVERAGE NUMBER OF POINTS EARNED ON EACH ITEM 


o Dee 2. A plot of the interaction of test items with race for 
roup I, n for each race 318. 


20 240 


PSAT-M, 


Figure 2 they are seen to become smaller at the lower left, indicating 
that the performance of Negroes and whites is more alike for the 
most difficult items than for the others. This probably ocours because 
the hardest items are too difficult and/or ambiguous for examinees 


of both races, causing a great deal of guessing Or use of misinfor- 


mation. The same converging phenomenon would occur at the 
ere items that 


Upper right (ie. for the easiest items) if there w 

Virtually all examinees of both races marked correctly. 
It is little, if any, more tedious to plot jnteractions in the manner 
“eyeballing” 


of Figure 2 than as in Figure 1, and the gain in “eyel 
efficiency can be great, especially when the interaction is more 
here. For more than two 


ipoumesd than for the data utilized ES 
evels of the plotted factor (say, for three ways of teaching arith- 
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metic) it may be well to connect the points in order to facilitate 
the tracing of fluctuations within the graph. 

If there are just two sets of points to be plotted, the procedure 
shown in Figure 3 may be preferable. There, differences (white 
minus Negro) are shown for each item, grouped around the hori- 
zontal line representing the mean overall difference favoring the 
whites. Vertical lines above this horizontal base line represent items 
on which the whites excelled more than usual, whereas vertical lines 
below the horizontal line represent items on which the Negroes 
scored better than usual. It is easy to see that the whites scored 
relatively best on the middle-difficulty items. 

By keeping the item-mean scale for the X-axis of Figure 3, one 
can determine visually whether magnitude of the differences seems 


1 REC = 
m ^5 o oo m ^ o 
Ord 9d 4055. -0o,0*0 29 9 


-40-20 0 20 40 60 80 100 120 140 160 180 200 


AVERAGE NUMBER OF POINTS EARNED ON EACH ITEM 
Figure 3. Interaction 
line, (Points above. 


hit ; poi i i 
eie ut ote Hine vepret stat 


DIFFERENCE BETWEEN POINTS EARNED BY WHITES AND NEGROES 


ifference 
of race with items, shown relative to average-differer 


p 5 5 : asier 
the horizontal line represent items especially Celatively 


nn 
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ted to the overall difficulty of the items, thus overcoming a 
cipal objection to level-free measures.* i A 


REFERENCE 
ary, T. Anne and Hilton, T. L. An Investigation of Item Bias. 
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1968, 28, 61- 
75. 


8The procedure can be extended to three or more groups by using as the 
horizontal reference line the overall mean of the groups. Then the Y-axis 
fill indicate deviations from this mean for each group, item by item, The 
X-axis will remain as in Figures 2 and 3. 
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DISPROPORTIONALITY OF CELL FREQUENCIES 
IN PSYCHOLOGICAL AND EDUCATIONAL 
EXPERIMENTS INVOLVING MULTIPLE 
CLASSIFICATION: 


R. KIRK STEINHORST ax» C. DEAN MILLER 
Colorado State University 


DisproportionaTs numbers of observations in the subclasses in 
cases of multiple classification in studies in education and psychol- 
ogy are a frequent occurrence. Many psychologists and educa- 
tors, when confronted with the problem of disproportionality, ob- 
tain proportional subclass numbers through a random sampling 
procedure. With a fairly large number of observations per cell and 
only slight or moderate deviation from proportionality, the re- 
searcher is able to obtain the necessary analysis, through random 
sampling procedures, to interpret the data. One of the difficulties 
has been an inability to specify what constitutes a fairly large 
number of observations and slight or moderate deviation from 
proportional or equal subclass numbers. With extreme deviations 
resulting from disproportionality, the researcher is confronted 
with a situation in which alternatives are not readily available. The 
authors present here a systematic formulation of general methods 
of attacking the problem. j 

Tsao (1946) presented a brief historical review of the discus- 
sions related to the problem of disproportionate subclass numbers. 
In one of the early published reports, Snedecor and Cox (1935) 
introduced the method of expected subclass numbers along with a 
discussion of the (a) method of fitting constants, (b) method of 
weighted squares of means, and (c) method of unweighted means. 
Nonmathematical readers may have considerable difficulty apply- 
ing the solutions which were reviewed by Tsao and proposed by 
Tsao (1946), Snedecor and Cox (1935), Brandt (1933), and Yates 
(1934), and be unable to justify the use of approximate methods 
while meeting requirements of à valid analysis. This paper 18 de- 


signed to enable the reader to better understand the applications 


Siar ace $i "B 
1 This research was supported jn part under National Institutes of Health 
Training Grant Number 5 T 1 GM 725. 
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of and the relationships between approximate methods 
methods for analysis of variance of multiple classifications. 


Analyzing the Problem 


There are several topics of prime importance to the € 
menter who is faced with analyzing experimental data w 
disproportionate due to subject loss, subject nonresponse, 
other of the various problems resulting in disproportionali 
plague the psychological or educational experimenter. Here 
there has been a lack of emphasis on the topics of exper 
model which has been assumed, interaction assumptions, defi 
of effects, and validity of approximations for the analysis 
proportionate data in tables of multiple classification. ! 
the items which will be discussed along with the fund 
question of whether the experimenter should or should not a 
disproportionate data in the first place. 

In response to the immediately preceding question, the a 
would suggest that with the linear model theory which has 
developed to date one can readily analyze disproportionate 
with the same theory as one would treat proportionate or 
frequency data (see Graybill, 1961). The results would be 
and adequate for the experimental analysis. Hence, seeming) 
need not force proportionality or resort to approximate proce 
However, this general theory of linear models is not often 
able to the nonstatistician and thus one is required to 00! 
forcing proportionality or applying approximate procedures. 
if the experimenter were well versed in the proper theory, 
computation may prove prohibitive without access to a lar; 
puter, or, if not prohibitive, at least the experimenter may judg 
approximate procedure as sufficient for the analysis. Ce 
one has an experiment dealing with imprecise Tesponses | 


rationale we see that approximations may serve educators 
psychologists in many cases. ; 
After accepting the appropriateness of inexact procedi 
experimenter must choose between forcing proportionality, 
ing an approximate procedure, or redesigning the exp 
Certainly, if extreme disproportionality exists, redesigning 
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licating the experiment may be the only choice. It may not be 
advisable to butcher data by adding or deleting items to force 
proportionality. Thus it becomes a subjective judgment as to 
whether or not the data are sufficiently disproportionate to warrant 
use of an approximate procedure. If no approximate procedure is 
found to be applicable, the experimenter is back to the problem of 
redesigning or replicating the experiment. 

Not enough attention has been given to the experimental model 
with which one purports to be dealing in any particular experi- 
ment. Certainly the analysis depends on whether one assumes all 
fixed effects, all random effects, or mixed effects. This is especially 
true for the approximate procedures. For a discussion of these 
concepts see Hays (1963). This topie is mentioned here in hopes 
that experimenters will not simply form analysis of variance tables 
and consequently tests of significance (in the form of F tests) with- 
out first checking to see which F tests are proper (see Ostle, 1963). 
As a matter of fact, it is of more than passing interest to note 
that F tests are often insufficient for analysis and more accurate 
interpretations of the experiment may be had through the use of 
confidence intervals. 

It is quite common to see an analysis of variance performed with- 
out any regard to the assumptions of interaction. The reader is very 
likely familiar with the traditional computational formulas reported 
in the literature. However, it should be noted that the analyses and 
interpretation differ between the cases where interaction of the 
factors is assumed and where interaction is not assumed to be pre- 
sent, Again, this is particularly true in approximate procedures. 
The rule here is to decide before the experiment if one should or 
should not test for interaction. If the answer is yes, this analysis 
must be done first and then further analysis may be performed in 
light of these results. Obviously if there is no interaction, the sums 
of squares must be partitioned among the main effects only. If 
there is interaction, then one must further define what is interesting 
in the experiment. To correctly apply 8n approximate solution 
the experimenter must consider the assumptions of interaction. 

Main effects and interaction effects may be defined mathemat- 
ically and empirically and must be considered in using approxi- 
mate as well as exact procedures. Effects may be defined as some 


Measure of the contribution or significance of treatment factors 
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and factor interaction. For a discussion of the main and interaction 
effects, see Hays (1963). The interesting fact here is that effe 
may be defined in several ways. Hays gives the traditional defini- 
tions and approach. But, as was noticed in the preceding section, 
the traditional definitions of main effects may not give interesting 
analysis in cases where interaction is present. For instance, one 
could average the main effects for a factor over different levels 
of the other factors, or one could consider the effects as different 
at different levels, knowing that the levels of that factor may op- 
erate differently at different levels of the other factors due to the 
interaction mechanism. One should be aware of what the effects 
that one’s analysis of variance is testing represent to the experiment 
and to the choice of factors and levels in designing the experiment 
(see Finney, 1948). 

The original and early reports of the use of approximate proce- 
dures contained largely mathematical solutions and procedures. It 
should be realized that the work of Brandt, Yates, Snedecor and 
Cox, and Tsao, concerning approximations to analysis of variance 
for disproportionate data was done in the 1930's and 1940's. There 
has been much statistical theory developed since then and as in- 
dieated above exact procedures have been developed which may 
be used. Also the questions concerning models, interaction, and 
effects were not deemed as vital as one realizes that they are today- 
Thus we will want to discuss approximate procedures in light of 
the above discussions, 

. The approximate methods most commonly used are set forth 
in the paper by Snedecor and Cox (1935). The reader should check 
this source or a similar one (Anderson and Bancroft, 1952; Brandt, 
1933; Yates, 1934) for computational procedures and examples: 


not the usual F statistics and which have different power functions i 
but are, in fact, truly distributed as P under the null hypothesis. 
Furthermore, the tests reduce to the normal tests if the frequencies 
are nob disproportionate (see below on validity). If the method 
gives a test for interaction it is only approximate. In light of this, 
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it is convenient to classify the method of weighted means as ap- 
proximate. A minimal explanation of the methods of unweighted 
means and weighted means appears below for the purpose of il- 
lustrating the rationale for the use of these nonexact solutions. 

Now let us consider a hypothetical experiment which has mul- 
tiple classifications and which has disproportionate cell frequencies. 
By this consideration, we shall indicate the method of attacking 
such problems in general. 

Suppose our data are as appears in Table 1. Obviously, we have 
sufficient disproportionality to justify an approximate procedure. 
If we had found that the data were only slightly disproportionate, 
we would have forced proportionality and followed an analysis 
such as is found in Snedecor (1946). 

One should be aware of the fact that in the proportional analysis 
(as opposed to the equal frequencies analysis) one is testing a com- 
bination of effects hypothesis in the analysis of variance and not 
the simple traditional ones. Hence, the experimental writeup and 
analysis should reflect this. One cannot simply say, “The interaction 


TABLE 1 
Data Arranged in a 8 X 2 Table with Disproportionate Cell Frequencies 


E 
E 


3.06 4.80 4.29 3.09 4.59 4.59 3.75 4.05 3.63 3.74 3.52 3.04 3.50 

5.38 3.71 4.13 3.32 3.80 3.89 5.22 3.82 2.74 4.76 3.40 4.00 3.64 
p 291 3.63 5.60 4.12 4.39 4.62 4.49 3.89 4.17 9,85 9.99 4,80 
| 474 3.28 5.92 5.30 4.24 3.17 4.39 

pied 3.10 4.96 4.53 5.21 3.82 4.21 

. 4.75 3.54 

eens X-4238 n= 17 X = 3.800 
E38 0 IU IRRUI UNDE Lc 

4.00 2.61 3.04 3.24 3.90 5.08 4.19 4.42 4.64 4.28 3.60 4.48 4.72 
p 379 4.41 4.00 4,08 4.46 4.40 4.10 4.10 4.58 
L' 806 452 3.68 4.32 4.03 4.07 4.04 

3.80 5.35 3.79 4.34 4.82 5.63 4.45 

EE M X = 4.348 
REI £-415 n=8 24. 
» $32 3.90 450 4.04 494 0.30 3.29 308 420 4.07 4.87 

¥ 4. ? 

wa ig 9 $75 98 OM Pade Nd X = 4.058 


uired in each cell that will yield proportion- 
th tho minimum Ataa cf cbsuues of data in the experiment. Then data are randomly deleted or added 

ooral dad i that one numbers 
pı h cell to agree with these numbers. By randomly deleted or added, we mean, for instance, one 


i i f the 
"aervat: it cell observati ooses a number from a table of random numbers which falls in the range o! 
numbers, then efthes onie that observation or duplicates it as it must be deleted or added. 


M vig ome generally determines the number of observations req 
4l 
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was (or was not) significant," since this measure contains contami- 
nation from main effects. 

Researchers without strong mathematical backgrounds may find 
it easier to follow a discussion of the use of the unweighted and 
weighted means procedures with an example showing the analysis 
of the data in Table 1. We shall first assume a fixed effects model 
with interaction, since we cannot say that we have no interaction 
of factors operating. The unweighted means analysis appears in 
Table 2. The weighted means analysis appears in Table 3. The 
authors have found these two methods sufficient for their needs. If 
the reader does not find this to be true, he should check the paper 
by Snedecor and Cox (1935) to find an applicable technique. 


TABLE 2 
Unweighted Means Analysis of Variance 
Source df Sums of Squares Mean Square F 
Total 110 1980.552 
Mean 1 1926.481 
"Treatments 5 5.357 
A 1 0.103 0.103 2.34 
B 2 0.089 0.045 1.02 
AXB 2 0.185 0.093 2.11 
Within 104 48.714 
Within (avg. 104 4.628 0.044 


1980.552 = X >) © Y' = (8.06)? + (4.80)? + --- + (4.87)” 


1926.481 = (ZEE Y/110 = is (3.06 + 4.80 + +++ + 4.87)" 


5.357 = 2, 2, (X Y)’/nas] — 1926.481 
161. 2 2 2 
by Geta" | (64.60) | Ue + er — 1926.481 


0.103 = by (> Pu.) /'s = (x = Pu.) / o 


4 


= (4.248 + 4.115 + 4.603)? i (3.8 + 4.848 + 4.058)" 
3 


3 


— (4.248 + 4.115 + +++ + 4.058) 


6 
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$ (È ra) /e-(EE tu) [a 


0.089 = 
(4.248 + 8.8)? , (4.115 + 4.348)" 
" 2 B 2 
(4.603 + 4.058)  — (4.248 + 4.115 + +++ + 4.058) 
+ 2 p 6 
a b 2 
0.185 = >> >> Žu. — 0.103 — 0.089 + (= »» Fu.) /^ 
A B 
= (4.248)" + (3.8)? + +++ + (4.058)" 
J +++ + 4.058)? 
— 0.103 — 0.089 + (5256 4 uit t 
48.714 = 1980.522 — 1926.481 — 5.357 
4.628 = 48.714/h 
where 
A a g 6 
i a/ > È Uma) = Sa FE 


The assumptions for unweighted means analysis are as follows: 
(a) no cell is empty, i.e. no nas = 0, (b) for preliminary analysis, 
especially when not much is known about the population, (o) the 
cell frequencies do not vary greatly from equality, (d) primary 
interest is whether interaction is or is not present, (e) one wishes 
to test main effects when interaction is negligible, and (f) exact 
solutions are prohibitive or not available, and the study or experi- 
ment does not warrant an exact solution. d 

It is readily apparent that judgment is a key factor in decisions to 
use an unweighted means solution. In the example, the cell fre- 
quencies do vary considerably and in several studies reviewed by 
the authors disproportionality to this extent was obtained. Assump- 
tions as to interaction were important in the analyses of the data 
reported in Table 1. In the opinion of the authors the nature of the 
Study (counselor assignment and client attitudes toward counsel- 
ing) did not warrant exact solutions. The studies which were is 
Viewed were exploratory studies and the information obtaine 
through a replication of the first study substantiated the saa 
Teported earlier (Gabbert, Ivey, and Miller, 1967). The elt 
- "heons solution made it possible to analyze the data without loss o 
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2 large amount of data and replication of the study was necessary 
in view of the assumptions and use of an approximate solution. 

The assumptions and limitations for the weighted means analysis 
are as follows: (a) no cell is empty, ie. no n4» = 0, (b) two 
classifications and only two levels for one of the classifications, (c) 


TABLE 3 
Weighted Means Analysis of Variance 

Source df Sums of Squares Mean Square F 
Total 110 1980, 552 
Mean 1 1926.481 
Treatments 5 5.357 

A 1 1.080 1.080 2.31 

B 2 1.164 0.582 1.24 

AXB 2 0.314 0.157 0.335 
Within 104 48.714 0.469 


ES e M EME Ben a e ea A aA aa 
All calculations are identical to the ones in Table 1 except the A, B, and A X B 
entries, They are derived through the following format. 


UWM 
En 4: SR D wW wM WD 
36 AH 17 4.037 1 dun 
à 3.800 0.474 47.475 5. 
B, 1/38 1/7 0.0851 en 
30 8 4.232 
B 4.115 4.348 0.233 60.329 20.784 1.454 


0.0333 0.125 0.1583 


Ramee E SSSA Gm e 1-0 o o E 
13 4 4.330 

B: 4.603 4.058 0.545 3.058 13.241 1.667 
0.0769 0.250 0.3269 


21.147 87.500 8.095 
UWM' 4331 4.009 


Sm 0.1365 0.4338 
Ww’ , 7.299 2.304 9.603 
WM 31.612 9.375 40.987 


40.987, 
1.080 = s[x (UWM’)(WM') — ssr] 
Z f 


87.5° 
1.164 = 2/5 (UWM)(WM) — = 
à F 
is 8.695" 
0.314 = »» (D)W D) — 31.147 


Note.—The format follows the example reported by George W. Snedecor and Gertrude M: 
Cox, Disproportionate Subclass N 3 y George W. Snedec T Bulletin, 
No. 180, March 1935, Ames, ss Numbers in Tables of Multiple Classification, Resoard 
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. one wants a measure of main effects when interaction is not negligi- 


ble, (d) inequality of cell frequencies is not, great, and (e) exact 
solutions are prohibitive or not available and the study does not 
warrant an exact solution. 

We find that interaction is insignificant in either analysis. Thus 
we accept the hypothesis that the interaction effect is zero. By con- 
sidering the expected mean square for interaction we see that we 
have merely another estimate of residual variance. It may be pooled 
with the within mean square to test the main effects and the analysis 
would be similar from this point to the case where interaction is 
originally assumed to be absent. Or, as is naively done most often, 
one may merely test the main effects with the within mean 
square. This is justifiable if the degrees of freedom for the within 
mean square is sufficiently large. The, analyses in Tables 2 and 
3 were handled in this way to enable the reader to relate the 
discussion to an actual study. 

If the interaction had been significant, we would not have tested 
the main effects but would have considered the simple effects in & 
manner analogous to the discussion reported in Winer (1962). 
The above analyses are justified by checking to see that the 
assumptions underlying the methods of weighted and unweighted 
Means are satisfied. 

Since our experiment had only two 
fixed effects computations and tests app 
data if we had assumed a random or mixed effects model. of course, 
as Hays (1963) indicates, the interpretations are quite different in 


the case of fixed effects, random effects and mixed effects pde 
lf this were generally true, there would have been no need 
the three and higher-way 


Worry about model assumptions. But, in 
classification experiments, one finds that the three models vary ^ 
to what mean square ratios should be used for proper tests o 
Significance (see Ostle, 1963). $ 
From such considerations one notices several things. ü 
1. The weighted means method is not applicable beyond the 


two factor situation. i 
2. As long as no empty cells appear, the method dari on 
means is most generally usable d offers an ane ye 


to what the experimenter is familiar with in the equal or 


Proportional frequency case. 


factors, we find that the 
ly for the analysis of the 


N 


808 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


3. If the unweighted means method is not applicable or precise: 
enough for the experimenter, it may be well to break the 
three or more factor experiments into a series of two factor 
experiments (see Federer, 1963). 


Validity of Approximate Procedures 


We shall now discuss the computational formulas for the equal 
frequency case, the unweighted means case and the weigh ed | 
means case in order to justify the two approximate procedures ant 
establish relationships with the equal frequency case. Without los 
of generality it will be convenient to consider again the two fa 
experiment with interaction where there are a levels of factor 4 
b levels of factor B, and n45 observations in cell AB. 

(1) Equal frequency formulae (n45 — n for all cells). 


C. F. = correction for the mean = (>, >> >> Y)*/abn 
88, = E(Xàr)/w- CF. 


88, = E(EÀr)/m-cr. 
854, = È x Y /[5.—.88, - 88, + CF. 
(2) Unweighted means formulae. 
C.F.’ = adjusted correction for the mean | 
si (x > Fa.) / as - [= X (S Yean) | / H 


B 
SS, = x 


Pa) /» - cp 


D 
Ao /» - on: 


A. 


5 Pa) /a- cr. 


D 
= oy Ying JI [a = CF! 


Li 


g 
ll 
sM- E Me h 
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SSas = Yus TY SS. E. SSs + C.F! 


b 
2j 
B 
b fr 2 
x 2 Yo /no) — 88, — 88, + C.F.’ 


If all the nap were equal, the formulas would reduce to the equal 
frequency case in (1) except that an n2 instead of an n would appear 
in the denominators of the sums. Hence, one divides the within 
sums of squares by the harmonic mean to find an “average” 
residual sums of squares which is appropriate for forming the 
proper F ratios. The harmonic mean has terms of 1/nas in its de- 
nominator and thus accounts for the extra power of n noted aboye. 

(3) Weighted means formulae (2 levels of A only). The 

correction for the mean will vary as indicated. 


88, = >) ELE 
4 > 1/nas >a 
P A 3 1/mas 
b 2 TY. ; 
Hd 5 INT 
D (È Pu.) 2 $inas 
y 1/nas Sapo 
x je a Y nis 
NES 
b 2 " 2; 1/naz 
SSus = Tu = Fa) Ee ea TUS 
: DF 1/nas Pc d 


if the n4»'s Were equal, the 
the formulas in (1). How- 
erified by substituting à 
summations involved. 


It is not immediately obvious that, 
formulas above would reduce directly to 
ever this is certainly the case and can be vi 
constant n for each nas and performing the 
Of course in each case 
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Y. sd = Yin/nas 
AB 


becomes, in the equal n45's situation, 


ve 1< 
Yu. = n »» Yin. 


We have thus established that these approximate methods are 
valid generalizations of the equal frequency method in the sense 
that, if the frequencies were equal, both approximate methods are 
equivalent to the exact one. This is in accordance with many 
mathematical theories, for instance, the definition of e” where 2 
is a complex number reduces to the real definition when z is a real 
number. One is not assured, however, that these procedures have 
other properties which statistical tests should possess. 


Conclusion 


When faced with an experiment with disproportionate cell fre- 

quencies one must 

A. Justify redesign or replication of experiment, forcing of pro- 
portionality, or using an exact or approximate technique on the 
data which are available. 

B. If an approximate technique is required, choose a technique 
which is appropriate to (a) the accuracy of the data, (b) the 
assumptions needed, and (c) the model assumed. 1 

C. State whether interaction may be working before the analysis 
is performed. 
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DEVELOPMENT OF MODERATED SCORING 
KEYS FOR PSYCHOLOGICAL INVENTORIES* 


DALE J. PREDIGER? 
University. of Toledo 


DuxwETTE (1963) and Ghiselli (1963) have reviewed research 
on the application of moderator variables to prediction problems. 
Conger (1967) presented mathematical and theoretical bases for 
explaining moderator variable operation. Two general approaches 
to the use of moderator variables have generally been cited. The 
first approach involves the development, by means of empirical 
keying techniques, of a scale that differentiates between individuals 
for whom inventory or test predictions differ in accuracy. In & 
variation of this approach, a scale is developed to identify which of 
two predictors is more accurate for a given individual. The second 
approach involves the use of & predetermined moderator variable 
rather than one that has been empirically derived. The relation- 
ship between predictor and criterion at different intervals along 
this moderator variable is calculated, thus allowing identification of 
subgroups with differing degrees of predictability. In an elabora- 
tion of this approach, Rock, Barone, and Linn (1967) used mul- 
tiple moderators to identify that set of sample subgroups which 
optimizes predictability obtained from multiple predictors. 

The quadrant-analysis technique reported by Hobert and Dun- 
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nette (1967) involves elements of both approaches. In many re- — 
spects it is similar to the moderated scoring key technique illus- 
trated by Prediger (1966). In this third approach to the use of z 
moderator variables, Prediger demonstrated that academic ability — 
is an effective moderator variable in the prediction of college drop- ` 
out from biographieal data (biodata). Separate scoring keys were 
developed for a biographical inventory at each of three ability — 
levels. For two of the ability groups, the contribution of biodata to — 
the differentiation of persisters and dropouts in a cross-validation 
sample was found to be greater than it was for the total group. For 
the third ability level, the results were mixed. Overall differentia- 4 
tion of persisters and dropouts was greater when separate ability 
level keys (e.g., moderated keys) were constructed than when one - 
overall key was used. Hence, a technique for increasing the overall 
predictive validity of an inventory by the development of moder- _ 
ated scoring keys was demonstrated. This approach also allows for - 
the identification of subgroups that have greater or lesser degrees of 
predictability, which is as far as the previous approaches developed 
by Ghiselli and others have gone. 

The primary objective of the present study involved the explora- 
tion and refinement of the techniques for developing moderated 
scoring keys. Investigation centered in two areas: 

1. Development of procedures for determining the optimum 
number of subgroups (and hence, moderated scoring keys) Te- 
quired for maximizing the predictive effectiveness of an inventory. 

2. Development of a single scale for reporting the scores ob- 
tained from a set of moderated scoring keys, thus facilitating the 
practical application of moderated keys to prediction problems. 

The secondary objective involved tryout of the moderated score 
ing key techniques in a practical situation, i.e., the prediction of 
college attendance from biodata. General and subsidiary hypothe- 
ses were formulated as follows: 

1. General hypothesis: The contribution of biodata to the pre- 
diction of college attendance is greater, upon cross-validation, 
for scoring keys developed within ability level subgroups (egu 
moderated keys) than for a key developed on the total group (e.£« 
the general key). 

2. Subsidiary hypotheses: Moderated scoring key predictions of | 
college attendance-nonattendance are more accurate than predic- 
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tions based on scores resulting from: (S1) the normal empirical 
keying procedures used in forming the general key; (S2) the above, 
plus academic aptitude scores in optimally weighted combination; 
and (53), academic aptitude scores used alone. 

In order for the general hypothesis to be supported, it is neces- 
sary to confirm only subsidiary hypothesis S1. However, S2 and 
83 are directly relevant to the practical application of moderated 
keys since these keys must produce more accurate predictions 
than could be obtained through other readily available means. In 
other words, there is no substitute for incremental validity. 


Method 


Sample 


Subjects constitute a subgroup of the stratified random sample 
of the U. S. high school seniors participating in the data collection 
phase of Project TALENT in the spring of 1960. The nature of 
the stratification variables and the representativeness of the ob- 
tained sample are discussed in a publication entitled “The Project 
TALENT Data Bank” (1965). Only males who responded to the 
Project TALENT follow-up questionnaire mailed in the summer 
of 1961 were included in the study. Of the 21,534 cases supplied 
by the data bank, 20,534 were available after screening for in- 
complete data. For purposes of analyses, these cases were randomly 
divided into four independent subgroups. 


Variables 


In order to form the moderator variable, reading comprehension 
and mathematics scores (data bank variables 250 and 320, Te- 
Spectively) were combined by use of equally px 
Standard scores, Both measures have been shown to have a sub- 
stantial relationship with the criterion to be predicted (Flanagan, 
Davis, Dailey, Shaycoft, Orr, Goldberg, and Neyman, phus bi 

The Student Information Blank (SIB), a Project ae DE 
graphical inventory, supplied the response pool from es spin 
keys were developed. Of the 394 items on the SIB, rks 
eliminated because they did not appear to fit the common ie of 
of biodata, These items chiefly covered study baie seu ilabl 
Various kinds, Approximately 1,650 response options were available 
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for item analyses. Although omits were treated as response op- 
tions in the item analyses, they were not included on any scoring 
keys. 

Student responses to item two of the 1961 follow-up question- 
naire (Flanagan et al., 1964) served to operationally define the cri- 
terion, college attendance. Those students indicating that they had 
entered college as full-time students were included in the college 
attendance group. All others were placed in the nonattendance 
group. 


Design and Analyses 


Twelve ability level subgroups as nearly equal in size as possible 
were chosen as being appropriate to the purpose of the study and 
the data available. The same score limits were used to identify 
these subgroups in each of the four subsamples. By use of the 
12 subgroups and by combining data from adjacent groups, scor- 
ing keys could be formed for 12, 6, and 3 subgroups plus the 
total group. Thus it was possible to empirically determine the 
optimum number of subgroups and, hence, moderated scoring 
keys required for maximizing the predictive effectiveness of the 
response pool. 

Since moderated scoring keys were formed at each of the twelve 
ability levels and combinations thereof, 21 moderated keys were 
developed. These plus the general key made a total of 22 keys. 
Four keys were appropriate to each of the 12 ability levels: the 
general key, the key specific to the ability level, the first order 
combined key formed with data from two adjacent levels, and the 
second order combined key formed with data from adjacent first 
order combined keys. The keys appropriate to each of the 12 
ability levels are shown in Table 1. As can be seen, key 22, the 
general key, was formed and scored on all students. The usual 
empirical keying techniques would have involved only this key. —. 

Group 1 analyses. Group 1 was used to obtain the item analysis 
data from which all scoring keys were developed. The records of 
10,183 students were randomly allocated for this purpose. For each 
response to an SIB item, a two-by-two contingency table was 
formed using marked-not marked and attend-nonattend dichoto- 
mies. Phi coefficients calculated from this data formed the basic 
item analysis statistics. 
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TABLE 1 
Keys Appropriate to Each of the Twelve Ability Levels 


Moderated scoring keys 
Ability Ability 1st order 2nd order 


level level keys combination. combination 


Qo I Oi» BD 
OONAN 
I 
e 


EC 
Noto. Ability level 1 represents the lowest ability group. Key 22, the general key, was 
scored at each ability level. 


Before keys could actually be formed, a procedure for determin- 
ing optimum key length had to be developed. The approach chosen 
involved the use of five phi values as cutoffs for keys of five differ- 
ent lengths. These phi values were .03, .05, .07, .10, and .15. Thus 
the longest subkey at a given ability level consisted of item Te- 
sponses with phi coefficients of .03 or higher. The shortest subkey 
used .15 as the cutoff. Item responses achieving these cutoffs were 
scored +1 or —1 according to the direction of relationship with 
the criterion. In this manner, a total of 110 subkeys were formed, 
5 for each of the 22 keys. 

Obviously, with the volume of item analysis data involved, some 
mechanical means of forming keys was required. Hence, no at- 
tempt to identify logical or illogical response validity data was 
made. Instead, a computer program was written to form subkeys 
and was applied in the same manner to item analysis data for each 
Of the 22 keys, 

Group 2 analyses. The records 0 J 
domly assigned to this group. Each student was placed in one of 
the 12 ability level groups on the basis of his moderator variable 
Score. The SIB responses of the student were then scored on four 
keys (see Table 1), each with a total of five subkeys of varying 
length. Thus 20 scores were obtained for each student. All scor- 
ing was done by a computer and at the same time, data was 


9 17 21 
| 10 10 17 21 
21 


f 3,387 students were ran- 
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accumulated so that point biserial key-criterion correlations for 
each of the 110 subkeys could be caleulated. The trend in these 
point biserials across the five subkeys scored for a given key was 
used to identify optimum key length for that key. Only this key 
length was used in subsequent scoring. 

Group $ analyses. Analyses on this group had three primary 
purposes: (a) to determine the optimum number of moderated 
scoring keys appropriate to the data; (b) to develop a common 
score scale on which key results could be reported; and (c) to 
develop equations for use in predicting college attendance versus 
nonattendance for students in Group 4, the cross-validation group. 
The records of 3,393 students were randomly assigned to Group 
3 analyses. 

In order to determine the optimum number of moderated scor- 
ing keys, a method for comparing the effectiveness of different 
keys had to be devised. This was accomplished by scoring each 
higher order key combination on the subsamples appropriate to the 
next lower order combination. Thus, the general key (key 22) was 
scored on the ability groups appropriate to keys 19, 20, and 21. If 
the development of moderated scoring keys is of no particular 
value, then key 22 should correlate as well with the criterion in the 
groups appropriate to keys 19, 20, and 21 as do the keys them- 
selves. At the same time, the effect of criterion group split on the 
point biserial correlation coefficient is controlled because all point 
biserial comparisons are made within the same group. 

Using this principle, three decision rules were formulated for 
determining the optimum number of scoring keys (Predigen 
1968). Application of these rules provides a flexible as well a8 
workable procedure for identifying the optimum number and com- 
bination of moderated scoring keys. For example, it would be 
possible for the general key to replace all of the moderated keys oT 
for some combination of keys involving the general key, second 
order keys, and ability group keys to be selected. Thus, ability 
groups 1-4 might be scored on the general key, groups 53208 
key 20, groups 9-10 on key 17, and groups 11 on 12 on their ow? 
moderated scoring keys. This is an example of a case where 
moderated scoring keys are unimportant for low ability groups 
but become more and more important as ability increases. 

Once the optimum set of keys was determined, a common 50079 
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scale had to be developed along with equations for predicting at- 
tendance versus nonattendance. Both of these problems were solved 
through use of the classification procedure described by Cooley 
and Lohnes (1962). This procedure gives a probability index of 
an individual’s similarity to one or more groups on the basis of 
scores available for the individual. In this case there were two 
groups, college attenders and nonattenders. The estimates of at- 
tendance group similarity formed a common score scale to which 
the moderated keys could be anchored. 

Group 4 analyses. Analyses conducted on this group were for 
the purpose of obtaining information bearing directly upon the 
general and subsidiary hypotheses. The records of 3,404 students 
were randomly assigned to these analyses. Scores for each student 
were obtained on the moderator variable, the general key, and the 
moderated scoring key appropriate to his ability level if one were 
found that performed better than the general key. Prediction equa- 
tions developed in Group 3 were then applied to these scores in 
order to obtain probability estimates of group similarity. In addi- 
tion, a probability estimate based on the first two variables in 
optimally weighted combination was obtained for each student. 
Whenever the probability associated with a given estimate ex- 
ceeded .50, “attend” was predicted. Otherwise, “nonattend” was 
predicted. Records of hits and misses were accumulated for each 
of the four predictors. The associated phi coefficients were also 
calculated. In addition, joint hit-miss data was obtained for all 
Possible predictor pairs. In this way it was possible to determine 
the number of cases for which one predictor hit when the other 
missed. Of two predictors, the one that did the best job of toor- 
rectng" the other's misses would be superior. A statistical test 
described by McNemar (1962) was applied to the joint hit-miss 
data in order to test each of the subsidiary hypotheses. 


Results 


Item analysis data obtained from Group 1 were used to form 
scoring keys and to determine optimum key length on Group 2. 
The optimum key lengths were then used in all subsequent ; 
For the 22 keys involved, five key-length cutoffs were set at a phi 
of .10, four at .07, nine at .05, and four at .03. Key length varied 


scoring. 
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from 76 to 815 item responses. Point biserial correlations with i 
criterion are shown in Table 2 for optimum key lengths. 4 
Primary objectives. Application of the decision rules for! 
termining the optimum number and combination of scoring keys? 
sulted in the general key being retained for scoring at eight of t 
twelve ability levels. Key 17 and 18, both first order combin t 
keys, were retained at ability levels 9-10 and 11-12, respectivi 
Thus, it would appear that ability serves as a moderator varia 
only for the more able students. Even for this group, the effect 
not great. D 
When applied to real data, the procedures and rules develo} 
for determining the optimum number of moderated scoring Ki 
worked well. It would appear that both can be readily applie 
other moderator variables, continuous or categorical, if san 
sizes are substantial. The common score scale would also appeal 
have general applicability. In principle, criterion predictions W 
used as a common basis for anchoring the various moderated sc 


TABLE 2 
Relationship between Biodata and Criterion for Optimum Key Length 


Sample size 
Key Attend Nonattend Tpt bis 
1 32 258 18 
2 45 220 27 
3 59 185 .85 
4 96 187 45 
5 117 168 AT 
6 155 144 .89 
7 147 136 AL 
8 154 108 .48 
9 206 80 .36 
10 229 78 .89 
11 230 55 .33 
12 261 37 .82 
13 17 478 «21 
14 +155 372 E 
15 272 312 43 
16 301 244 46 
17 435 158 38 
18 - 401 92 .85 
19 |. .932 850 40 
21 926 250 .40 
22 L731 1656 .58 


Note.—Analyses done on Group 2. 
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ing key score distributions. This procedure should work equally 
well whether criterion group membership or criterion scores are 
being predicted. In the latter case, regression rather than classifica- 
tion procedures could be used to form the common score scale. 
Whenever data are available on the actual predictive validity of an 
instrument, a common scale for reporting moderated scoring key 
results can be developed. Finally, all of the steps involved in de- 
veloping and validating moderated scoring keys can be economi- 
cally performed by computers. Approximately seven hours were 
required to run the specially written FORTRAN programs on an 
IBM 360-44. 

Secondary objectives. Data showing the relationship between 
predicted and actual status on the criterion are summarized in 
Table 3 for each of the four predictors. It is immediately apparent 
from the hit rates and phi values that the use of moderated keys 
holds no particular advantage over the use of the general key in 
this particular situation. The phi coefficient (54) and hit rate 
(77%) for the moderated keys were both only slightly higher than 


TABLE 3 
Comparison of Accuracy of 4 Predictors in the Prediction 
of College Attendance for Group 4 
Predicted status 
ELS 0 v 
Moderated keys 
Mund 386 (11%) 1343 (89%) 
onattend 1273 (87%) 402 0200 ggg 981.1 
General key 
Attend 417 (12%) 1312 (88%) 
onattend 1278 (38%) 397 (12%) 76% .2 926.6 
Moderator variable 
Attend 372 (11%) 1357 (40%) 
onattend 1056 (31%) 619 0870 4 42 602.6 
Best combination of general key and moderator variable 
Attend 379 (11%) 1350 (40%) 
onattend 1264 (87%) 411 0270 ng — ore. 


Y represents parate 
Note.—Per cent of total N given in parentheses is rounded. Hit rate mm 
Founding of totals, 
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the phi coefficient (.52) and hit rate (7695) obtained when the 
general key was used alone. Thus, academic aptitude does not 
appear to be an effective moderator variable for the predictor and 
criterion under study. 

The same data are presented in a different form in Table 4. Here 
joint hit-miss rates are shown. The most interesting aspects of this 
data are the instances in which one predictor made a correct pre- 
diction (a hit) while the other yielded a miss. For example, the 
moderated keys correctly predicted 135 of the general key misses 
while the general key correctly predicted 109 of the moderated 
key misses. Although this difference in favor of the moderated key 
is statistically significant at the five per cent level (one-tailed test), 
it is certainly not great. On the other hand, the difference in favor 
of the moderated keys is substantial when compared with the mod- 
erator variable. Finally, it can be seen that there is very little 
difference between the joint hit-miss rate for the moderated keys 
and the general key used in combination with the moderator vari- 
able. 

On the basis of these results, it can be concluded that subsidiary 
hypotheses S1 and 82 are not supported. Thus the general hypo- 
thesis relevant to the secondary objective of the project is also not 
supported. While subsidiary hypothesis S3 is supported, academic 


aptitude does not appear to be an effective moderator variable for 
the data under study. 


TABLE 4 
Joint Hit, Miss Rates for Moderated Keys Versus Other Three Predictors 

Moderated key Miss Hit 2 
Hit 135 Ws ae (73%) 
Miss 679 (20%) 109 (8%) 

3 Moderator variable 04 
EL S em 
M du combination a ipie key and moderator SE à 
AD 5 2a 


neamme E a ee NERNIRNUR deter 
^s test for nonindependent proportions performed according to McNemar (1962, pp. 52-56). 


| 


AME ey 


DALE J. PREDIGER 
Discussion 

One of the most interesting aspects of the results presented in 
Tables 3 and 4 is the level of accuracy achieved by the predictions. 
A 77 per cent hit rate in the prediction of gross and arbitrarily 
defined criteria such as college attendance is seldom found, espe- 
cially with a base rate similar to the one in the sample under study. 
While the level of relationship expressed by a phi of .54 may not 
seem high, one must remember that the phi coefficient is smaller 
than the Pearson product moment correlation coefficient in situa- 
tions where it is possible to compute both. 

Equally surprising, in light of previous research, is the relative 
performance of biodata and academic aptitude as predictors of 
college entrance. In a reversal of their usual roles, academic apti- 
tude added little to the predictive accuracy obtained through use 
of biodata alone, Finally, the point biserial correlation (.60) be- 
tween biodata and the criterion was greater than that achieved 
(.57) in a highly similar sample by use of a multiple regression 
equation combining 51 cognitive variables (Flanagan et al., 1964). 
Hence, biodata would appear to have considerable promise as & 
predictor of college attendance. 

It is important to understand that the scoring techniques de- 
veloped in this study exist apart from the data to which they are 
applied. While it may be that the techniques will be of no practical 
value in any setting, such a conclusion is not warranted on the 
basis of the results obtained in this study. Similar, though less elab- 
orate techniques have been shown to work in a different setting 
(Prediger, 1966). It well may be that moderator variables are 
situation-specific as Ghiselli and Sanders (1967) suggest. 
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HERE is a growing literature which shows that volunteers for 
ticipation in behavioral research differ from nonvolunteers on 
l important dimensions.* Volunteers appear to be better edu- 
and to occupy higher status positions than nonvolunteers. 
Volunteers also seem to be brighter, higher in approval need, less 
uthoritarian, more sociable, more arousal-seeking, more uncon- 
entional, more often firstborn, and younger than nonvolunteers. 
fen seem to volunteer more for unconventional studies; women, 
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for conventional studies. Volunteers for survey research seem to 
be better adjusted than nonvolunteers, while in medical research 
the relationship between psychological adjustment and volunteer- 
ing may be reversed; in psychological research the relationship is 
equivocal. 

One implication of these differences is that there may be severe 
restrictions on the generality of findings from studies employing 
only volunteer subjects. Kinsey’s survey research findings on hu- 
man sexual behavior illustrate this point (Kinsey, Pomeroy, and 
Martin, 1948; Kinsey, Pomeroy, Martin, and Gebhard, 1953). Kin- 
sey’s subjects, by virtue of their willingness to be interviewed, may 
have shared other characteristics which both differentiated them 
from people who were unwilling to be interviewed and also seri- 
ously biased the research findings. It was discovered that people 
who are high in self-esteem tend to have somewhat unconven- 
tional sexual attitudes and behavior, and that those who are apt to 
volunteer for a Kinsey interview are likely to be higher in self- 
esteem than people who would refuse to be interviewed (Maslow, 
1942; Maslow and Sakoda, 1952). The implication is that Kinsey’s 
interviewees may have been different from the rest of the popula- 
tion on the very dimension that the investigators were intent on 
studying (cf. Siegman, 1956). 

‘The potential biasing effect of using volunteer subjects is im- 
plied in other nonexperimental research as well. If volunteers are 
indeed brighter than nonvoluntcers (Martin and Marcuse, 1957, 
1958; Reuss, 1943), then it is plausible that standardizing an IQ 
lest on a sample of volunteers would produce artificially inflated 
porma; By the same token, nonrepresentativeness might result from 
using volunteer subjects in the standardization of a test of approval 
need, or tests of sociability or conventionality, since there is evi- 


dence that volunteers and nonvolunteers may differ on each of 
these dimensions, 

A more intriguing possibility is that experimental effects might 
also be influenced by subjects’ volunteer status. Suppose that we 
recruited volunteer subjects for the purpose of testing the effect 
d n E erimental manipulation on the dependent variable of 
gregariousness. Since volunteers may be higher in sociability than 
nonvolunteers (London, Cooper, and Johnson, 1962; Martin and 
Marcuse, 1957, 1958; Poor, 1967; Schubert, 1964), our manipula- 
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tion, which was designed to increase gregariousness, might be too 
harshly judged as ineffective because the untreated control group 
would already be unusually high on this factor. In principle, then, 
the use of volunteer subjects could lead to an increase in Type II 
errors. 

However, it is also possible to conceive of other situations re- 
sulting in the opposite type of error. Suppose that we were in- 
terested in finding out how persuasive a propaganda appeal was 
before using it in the field, and that we decided to assess its effec- 
tiveness in a pilot study employing a sample of volunteer subjects. 
Theoretically at least, the fact that the subjects were volunteers 
might lead us to overestimate its persuasiveness, because people 
who are high in the need for social approval tend to be more in- 
fluenceable than those low in this trait (Crowne and Marlowe, 
1964), and because volunteers may be higher in approval need 
than nonvolunteers (e.g., Hood and Back, 1967; Leipold and James, 
1962; McDavid, 1965; Poor, 1967). Therefore, if a sample of volun- 
teers who had been exposed to the appeal were compared with an 
untreated control group, the magnitude of the differences be- 
tween the groups would be overestimated if, as implied by these 
findings, the experimental group overreacted to the propaganda. 

In the absence of relevant empirical data, the validity of such 
speculations as these, which are based on the apparent implications 
of presumably reliable differentiating characteristics, is of course 
problematic. It was for this reason that an earlier study was con- 
ducted, to gather data bearing on the fundamental question of 
whether or not volunteer status actually has any effect on Epor 
mental outcomes (Rosnow and Rosenthal, 1966). The reactions 
of female volunteers and nonvolunteers for an unrelated Perey: 
tion study were compared in a standard opinion change experi- 
ment, The experimenter, who was rated by another sample of sub- 

jects from the same population as having moderately anti-fraternity 
È views, read to the research subjects a one-sided pro-fraternity or 
anti-fraternity persuasive communication. The data, which ‘were 
analyzed relative to a noncommunication control group, disclosed 
that volunteers for the unrelated perception study were more 
Tesponsive than nonvolunteers to the anti-fraternity communica- 
tion and that volunteers were less responsive than nonvolunteers 
to the pro-fraternity communication. Reasoning that the con com- 
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munication would have been congruent with the subjects’ percep- 
tions of the experimenter’s anti-fraternity views and the pro 
communication incongruent, it could be hypothesized that the volun- 
teers were more sensitive and accommodating than the non- 
volunteers to what they may have perceived to be the dominant 
demand characteristics of the experimental situation.* In this case, 
in which the pro communication may have aroused conflicting de- 
mands and the con communication complementary demands, it 
would appear that the volunteers were more acquiescent to the less 
obvious demands implying the experimenter’s wishes than to the 
more blatant demands inherent in the communications. It was also 
discovered that the volunteers’ opinion changes from pretest to 
posttest showed less predictability than did the nonvolunteers’ 
opinion changes. This significantly greater unreliability may re- 
flect the volunteer subjects’ greater willingness to be influenced 
in whatever direction they felt was demanded by the situation. 

The present research was designed to explore further the rela- 
tionship between demand characteristics and volunteer bias. 


Experiment I 
Method 


Rationale and hypotheses. Adopting the same basic design that 
was used by Rosnow and Rosenthal (1966), the reactions to com- 
munications by volunteers and nonvolunteers for an unrelated study 
Were compared in a standard impression formation experiment. It 
was assumed that the experimenter's private views on controversial 
issues, which may have been a source of conflicting demands in 


vost Pto ratbut antient and active beings who are usually concerned 


perceptions of its purpose may affect their behavior, For this reason, the cues 
Peer ic vibes perceptions of their role and of the experimenter's 
zi Om c be important, determinants of their behavior in the experi- 

e refers to the sum total of these cues as the demand characteristics 


of the experimental situation. The results of a seri i 

i : es of methodological 

Mer of these plausible antecedents of systematic error suggest that the 
res Source of demand characteristics may be the procedure itself 

evalua me the context of such other information as the setting, instructions, 

rumors about the purpose of the experiment, and subjects’ impressions of the 


experimenter (Orne, 1969). For a discussi á E. 
tematic error in educational research see Roa olin xd md sources of Sy! 
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the earlier study, would now constitute a logically irrelevant factor. 
To achieve this effect, standard one-sided personality-impression 
communieations modeled after those used by Luchins (1957) were 
substituted for the one-sided persuasive communications that served 
as stimuli in the preceding study. Reasoning that demands asso- 
ciated with the experimenter's wishes and expectations would 
now be perceived as congruent with demands inherent in the com- 
munications, it was posited that the experimental hypotheses would 
appear to be more obvious and straightforward here than in the 
preceding study. Based on the relationships between approval need, 
volunteering, and influenceability noted earlier, it could thus be 
hypothesized that volunteer subjects would show a consistently 
greater degree of acquiescence than nonvolunteers in whatever 
direction was implied by the stimuli. 

Procedure. Four introductory sociology classes at Boston Uni- 
versity provided the 103 male and 160 female subjects. Following 
a similar recruitment procedure as was used by Rosnow and Ros- 
enthal (1966), all of the students were invited by their instructors 
to volunteer for either or both of two fictitious psychology experi- 
ments, one represented as being on psychoacousties and the other 
on group behavior. Out of 263 students, 10 men and 43 women 
volunteered for at least one experiment. One week later the entire 
sample was divided at random into five groups, and all of the sub- 
jects—volunteer and nonvolunteer alike—were presented the fol- 
lowing brief introductory passage to read about a fictitious char- 
acter named “Jim,” patterned after the character developed by 
Luchins (1957) for his research on impression formation: 


In everyday life we sometimes form impressions of people 
based on what we read or hear about them. 4 
On a given school day Jim walks down the street, sees a girl 
he knows, buys some stationery, stops at the candy store. 


For one group of subjects the introductory passage was immedi- 
ately succeeded by the following one-sided, positively-slanted com- 
munication (P), intended to portray “Jim” as friendly and outgo- 
ing: 
Jim left the house to get some stationery. He walked out into 
the sun-filled street with two of his friends, basking in the 
Sun as he walked. Jim entered the stationery store which was 
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full of people. Jim talked with an acquaintance while he 
waited for the elerk to catch his eye. On his way out, he 
stopped to chat with a school friend who was just coming into 
the store. Leaving the store, he walked toward school. On his 
way out he met the girl to whom he had been introduced the 
night before. They talked for a short while, and then Jim 
left for school. 


For another group the introductory passage was succeeded by the 
following one-sided, negatively-slanted communication (N), in- 
tended to picture “Jim” as shy and unfriendly: 


After school Jim left the classroom alone. Leaving the school, 
he started on his long walk home. The street was brilliantly 
filled with sunshine. Jim walked down the street on the shady 
side. Coming down the street toward him, he saw the pretty 
girl whom he had met on the previous evening. Jim crossed 
the street, and entered a candy store. The store was crowded 
with students, and he noticed a few familiar faces. Jim waited 
quietly until the counterman caught his eye and then gave his 
order. Taking his drink, he sat down at a side table. When he 
had finished his drink he went home. 


The reactions of those subjects exposed to the one-sided communi- 
cations above would be used to test the acquiescence hypothesis. A 
third group, which served as a noncommunication control, received 
just the introductory passage. The two remaining groups read the 
positively-slanted and negatively-slanted communications in juxta- 
posed order (PN or NP). All of the subjects then rated “Jim” 
on four 9-point bipolar scales, indicating the degree to which he 
seemed. friendly or unfriendly, forward or shy, social or unsocial, 
aggressive or passive. 


Results 


Because of the very unequal numbers of subjects within sub- 
groups, all analyses of each of the four ratings made by subjects 
were based on unweighted means. The analyses of variance of the 
five treatments by volunteer status by sex of subject showed only 
significant effects of treatments. For each of the four ratings ana- 
lyzed in turn, ps for treatment were less than .001. In these overall 
analyses no other ps for treatment were less than .001. 
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Our greatest interest, however, was in the interaction of volun- 
teer status with treatments. This interaction was computed for 
each of the four dependent variables, and only two of the associated 
ps were less than .20. For the variable “friendly,” p was .12; for 
"sociable," .17. With so many treatment conditions, however, the 
F values for unordered means are relatively insensitive, and it 
may be instructive to examine separately for volunteers and non- 
volunteers the specific experimental effects in which we were most 
interested. 

One-sided communications. Table 1 shows for volunteers and 
nonvolunteers the difference between the control group mean and 
the mean of each of the four experimental groups in turn, Each of 
the entries in Table 1 is based on data from both male and female 
subjects combined without weighting. It is apparent that the two 
one-sided communications were most effective among all subjects. 
Although these effects were not significantly greater among volun- 
teers than among nonvolunteers, the trend was in that direction. 


TABLE 1 


Effectiveness of One-Sided and Two-Sided Communications 
among Volunteers and Nonvolunteers 


Treatment minus Dependent 


control variable Volunteers Nonvolunteers 
Positive (P) minus Friendliness 42.18** -1.T8** 
control Forwardness --1.45** 30.42. 
Sociability +1,92** +1.30' 
Aggressiveness +1.15* +0.20 
Negative (N) min Friendliness —1.55** —0.70 
cri à is Forwardness —2.04** —1.68" 
Sociability —2.22** -1 91 
Ex Aggressiveness —1.36** —1.56 
PN minus control Friendliness —0.58 He 
Forwardness —0.34 —0. 
Sociability —1.08* 40.2; 
Aggressiveness —0.34 —0.34 
NP minus control Friendliness - 0.30 des 
Forwardness +0.04 -0. 
Sociability —0.30 +0.06 
Aggressiveness —0.45 —0.52 


" i i 00 (friendly, 
Note. Scores could range from —4.00 (unfriendly, ehy, unsocial, passive) to 1300 HESS: 
forward, social, aggressive). Positive values indicate Hexen pus bees m 
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Of the eight tests for significance of the effectiveness of the one- 
sided communications, all eight reached the .05 level among volun- 
teers, while only five of the eight tests reached the .05 level among 
nonyolunteers. One implication of this finding is that an investi- 
gator employing similar sample sizes to test the effectiveness of 
communications such as these, and a strict decision model of in- 
ference with an alpha of either .05 or .01, would arrive at differ- 
ent conclusions over one-third of the time were he to employ 
volunteer rather than nonvolunteer subjects. 

More important perhaps, and consistent with the acquiescence 
hypothesis, is that whenever differences in significance levels oc- 
curred it was the volunteers who tended to favor more the implied 
experimental hypothesis. When the communication was positively- 
slanted, volunteers became more positive than nonvolunteers; when 
the communication was negatively-slanted, volunteers tended to be- 
come more negative than nonvolunteers. These results, then, are 
supportive of the acquiescence hypothesis and lend further cre- 
dence to the notion that the effects of experimental treatments 
may sometimes be a function of subjects’ volunteer status. 

Two-sided communications. The effects of the “two-sided” com- 
munications disclosed less volunteer bias. The two-sided com- 
munications were generally ineffective regardless of whether they 
were compared to the control group (as shown in Table 1) or to 
each other (not shown). A plausible explanation for this finding 
may be that, because the subjects were presented opposing views, 
they were unable confidently to detect demand characteristics that 
might differentially have influenced their reaction to either side 
(cf, Rosnow, 1968). Thus, only one of 16 mean differences indicat- 
ing the effectiveness of the two-sided communications was signifi- 
cant at the .05 level, about what might be produced by chance. 
Nevertheless, that one “effect” did occur among volunteers. 


Experiment II 
A follow-up study was conducted utilizing a deception proce- 
dure to investigate further the relationship between obviousness of 
demand characteristics and volunteer bias. Our expectation in this 
exploratory study was that the deception might function in a simi- 
Li manner as theoretically the two-sided communication had in 
Experiment I, in effect removing or disguising demand character- 
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istics that would otherwise favor one direction of response over 
another. 


Method 


The deception selected for this purpose was adapted from a 
procedure introduced by Brehm (1956) to test a particular deriva- 
tion from Festinger’s (1957, 1964) theory of cognitive dissonance, 
Festinger has maintained that one source of cognitive dissonance is 
having to commit oneself to one of several specified alternatives, 
His position is that commitment to one alternative may automati- 
cally imply the dissolution of prior commitments to inconsistent 
alternatives. As a result the decision-maker must suffer the cogni- 
tive dissonance that follows from having chosen a not totally per- 
fect alternative and thus having rejected other alternatives that 
were not entirely imperfect. Because the ensuing psychological 
discomfort would have the effect of motivating him to reduce his 
dissonance, thereby alleviating his discomfort, he should, according 
to the theory, actively attempt in one or another way to achieve a 
State of consonance. 

An interesting way of reducing dissonance in this situation is to 
“spread apart” the values of the choice alternatives. By increasing 
in his own mind the distance between the chosen and the rejected 
alternative(s), the decision-maker should be able to justify his 
irrevocable choice, reduce his cognitive dissonance, and concomi- 
tantly ameliorate his discomfort. The greater the cognitive dis- 
sonance, the greater presumably will he spread apart the alterna- 
tives. To achieve this effect he would increase the perceived 
importance of the chosen alternative and decrease the importance of 
the rejected, or unchosen, alternative(s). How much cognitive dis- 
sonance would actually result in a given situation, and thus the 
amount of spreading apart that should occur, is suggested by 
Festinger, as shown in Figure 1, to be a function of the relative 
attractiveness of the unchosen alternative and the importance of 
the decision. 
attractiveness of the unchosen al- 


... for any given relative es 
ternative, the more important the decision or the greater the 
ative, the greater would be 


attractiveness of the chosen altern à 
the resulting dissonance. As the relative attractiveness of the 
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Figure 1. Theoretical gradients for postdecision dissonance reduction. (After 
Festinger, 1957, P. 38, by Permission of the author and publisher.) 


unchosen alternative decreases, 


the resulting dissonance also 
decreases. (Festinger, 1957, P. 38) 


Although other derivations from cognitive dissonance theory have 
been a source of frequent debate in recent years (e.g., Rosnow and 
Robinson, 1957, Pp. 300 f., 407 f.), at least in regard to this issue— 
the resolution of dissonance after a “free choice” decision—the 
literature is comparatively free of controversy (e.g., Insko, 1967; 
Pp. 206 ff.; Jones and Gerard, 1967, Pp. 211 ff.). A number of ex- 
Periments attest to the reliability of the empirical phenomenon of 
the spreading apart of choice alternatives (Allen, 1964; Brehm, 1956; 
Brehm and Cohen, 1959 ; Brock, 1963; Davidson and Kiesler, 1964; 
Deutsch, Krauss, and Rosenau, 1962; Festinger and Walster, 1964; 
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Gerard, Blevans, and Malcolm, 1964; Jecker, 1964; Walster, 
1964; Walster, Berscheid, and Barclay, 1967). 

Procedure. The research was carried out in three phases. In the 
first phase undergraduate women at Boston University were re- 
cruited for a fictitious perception experiment. The second phase 
was begun immediately after the recruitment procedure was com- 
pleted. A second experimenter was represented to the students as 
a member of a research team from the University of Wisconsin 
that was conducting surveys at several colleges and universities 
throughout the country. Inserted among irrelevant items in the 
survey questionnaire was one which asked the students to rate 
on an 11-point scale how important they personally perceived 


each of the following ideas: 


There should be more no-grade courses at universities. 
Universities should give up all functions of acting in loco 
parentis (in the place of parents). 

There should be courses in sex education. 

Students should “unionize” to gain a more powerful and more 
effective voice in running universities. 

All students should be compelled to attend a spec! 
of cultural events or places during the school year. 
College programs should alternate semesters of study with 
“semesters” of working at jobs in the nonacademic commun- 
ity. 

There should be at least one semes 
nonclassroom setting, involving severa 
instructors. 
There should be increased use of teaching machines for “rote 
memory” material, to speed up the teaching process. — 
Universities should bring in people from the business com- 
munity to conduct classes. 


ified number 


ter of study in a non-formal, 
] subjects and several 


All students should be required to take education courses. 
There should be courses in the use and control of hallucinatory 
drugs. 


All university-sponsored, university-organized activities and 


social functions should be eliminated. 
ut one month later. At that time & 


The third phase was carried o 
third experimenter was represented to the students as an employee 
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of the Communication Research Center at Boston University. The 
students were told that the Center was conducting a survey among 
Boston University students as a follow-up to the research carried 
out by the University of Wisconsin team. Each student was then 
administered a questionnaire individually constructed for her, 
which listed two of the ideas from the “Wisconsin survey,” and she 
was instructed to select one according to the plan outlined below. 
After she had made her selection, the subject re-rated the im- 
portance of the choice alternatives and then rated on a 7-point scale 
the importance of the survey. 

1. Importance manipulation. A condition of high importance 
was experimentally induced by instructing some of the students at 
random that Boston University planned, the following year, to 
institute a new program corresponding to one of the ideas listed 
in the questionnaire. These subjects were further told that their 
preferences would be used to guide the University in selecting 
the new program. This deception was intended to convey the high 
importance of the consequences of the subjects’ decisions for the 
University and for them personally. The remaining subjects, com- 
prising a low importance condition, were instructed that several of 
the ideas listed in the Wisconsin survey had been found to be 
important by students at other universities and colleges, and that 
the two contained in their questionnaire were relevant for Boston 
University students. These subjects then simply indicated which 
idea they thought was most important for students at Boston Uni- 
versity. This deception was intended to convey the fact that the 
Consequences of the subjects’ decisions were of no great practical 
importance. 

E Attractiveness manipulation. Reasoning that the greater the 
difference in rated importance between the choice alternatives the 
less would be the relative attractiveness of the unchosen alternative, 
Conditions of high and low attractiveness were also manipulated. 
Alternatives were selected for the high attractiveness condition 
which had previously been rated zero to two units apart, and 
alternatives were selected for the low attractiveness condition 
which had been tated three or more units apart. The actual mean 
difference in ratings in the high attractiveness condition was 0.5 
units, and that in the low attractiveness condition was 3.2 units. 
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Table 2 summarizes the results. An unweighted means analysis 
of variance computed on the net change data disclosed overall 
effects roughly corresponding to those predicted by cognitive dis- 
sonance theory. As expected, greater spreading apart of the choice 
alternatives resulted from high as opposed to low attractiveness 
of the unchosen alternative (F = 15.59, df = 1/101, p < .001) 
and from high versus low importance (F = 2.64, p = .11). Also as 
expected, no significant difference in net change was revealed be- 
tween volunteer and nonvolunteer subjects, nor were any of the 
interactions for net change between volunteer status and importance 
or attractiveness significant (all Fs « 1). 

However, while the net change data in Table 2 are strikingly 
consistent from volunteers to nonvolunteers, the data for chosen 
and unchosen alternatives suggest differential changing between 
volunteers and nonvolunteers. Consistent with the latter observa- 
tion was the finding of a significant fourth order interaction in a 


TABLE 2 


Reduction of Postdecision Dissonance by Changing the Importance of 
the Chosen and Unchosen Alternatives 


Experimental variable n i 


Volunteer subjects 


Low importance 
Low attractiveness B +0,09 T cates 
y, ich attractiveness 13  +0:31 ma f 
gh importance 92 
To. > 0.21 —0.71 T0. 
w attractiveness 14 A 59 —1.64 +2.23* 


Nonvolunteer subjects 


Low importance he 
Low attractiveness 13 —0.54 215: nac 
High attractiveness 15 +1.23 —1. H 


High importance 
Low attractiveness 


High attractiveness 14 +2 
negative values indicate a de- 


Note.—Positive values indicate an increase in importance; fe tko aindhoban, 
Crease, Net change is th indica tho chosen alternative minus the chenge hose 
Which indicates Se nat "opes apart’ h following the choice. A positive 
Det change value is evidence of dissonance reduction. , 

* Net change greater than zero at p < 0l, two-tailed test. 
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volunteer status by importance by attractiveness by repeated mea- 
sures analysis of variance of these differential change data (F = 
48.58, df = 1/101, p < .0001), though no other interaction with 
volunteer status was significant at even the .10 level. Table 2 re- 
veals, especially in the case of attractiveness, but also for impor- 
tance, that there was less effect of these variables on volunteers 
than nonyolunteers in changing ratings of the chosen alternative. 
For the unchosen alternative, however, volunteers show a some- 
what greater effect of attractiveness than nonvolunteers. Since 
dissonance researchers may not always look separately at the alter- 
natives, but only at the net change data, the implication of this 
interaction is that an overall decision about the success of the 
manipulation may be only indirectly affected by subjects’ volun- 
teer status. 

Check on manipulated importance. To provide the customary 
internal check on the success of the manipulation, subjects in the 
high and low importance conditions rated on a 7-point scale how 
important they perceived the survey. The results, which are shown 
in Table 3, reveal only a negligible difference between conditions of 
high and low importance (F < 1), a finding which may not be in- 
consistent with the fact that the experimental effect of these varia- 
bles was also of only borderline sginificance. More interesting, how- 
ever, is the finding in Table 3 that volunteers tended, somewhat 
surprisingly, to rate the survey slightly lower in importance than 
did nonvolunteers (F = 2.11, df = 1/102, p < .16). Also, volun- 
teer status showed a tendency to interact with the experimental 
manipulation of importance (F = 3.89, p < .06): volunteers rated 
the low importance condition slightly higher in importance than 
did nonvolunteers, but volunteers rated the high importance condi- 


TABLE 3 
Importance of the Survey 
Experimental condition Volunteers Nonvolunteers 
Low importance 4.83 4.70 
CHE MES n = 23 i ji 
High importance s 4.43 ^ A is 
(n = 35) (n = 21) 


Note.—Values could range 
score of 4 Indicating that the rm extremely unimportant, to 7, extremely important, & 
* Response of one subject is missing, — rint nor unimportant, 
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tion slightly lower in importance than did nonvolunteers. This 
intriguing reversal seems to run counter to our earlier hypothesis 
that volunteers are more sensitive and accommodating than non- 
volunteers to demand characteristics. It can be argued, however, 
that the volunteers’ ratings may reflect a kind of “boomerang” 
effect in response to the blatancy of demands in this case. Per- 
haps the experimenter's asking subjects to rate the importance of 
her study simply aroused their suspicions concerning her “real” 
intent and whether her own hypothesis might not be more subtle 
than it appeared. That greater suspicion of deception could have 
been operating among volunteers than nonvolunteers may be im- 
plied in Stricker, Messick, and Jackson’s (1967) finding that 
suspiciousness of purpose tends to correlate positively with social 
desirability (approval need) and with intelligence, both of which 
characteristics, we noted earlier, appear themselves to correlate 
positively with volunteering. 


Discussion of the Findings 


The hypothesis was developed earlier that volunteers may be 
more sensitive and accommodating than nonvolunteers to de- 
mand characteristics. Our findings in these two exploratory studies 
raise a question as to whether that relationship may not be medi- 
ated to some extent by the obviousness of the demand character- 
istics operating in a given situation. By “obviousness” we mean the 
degree of recognizability of demand characteristics, a factor which 
is conceived here as being a function of the simplicity, con- 
sistency, and blatancy of cues pertinent to the experimental by- 
potheses. The relationship that seems to emerge from these findings 
is that at one end of the obviousness continuum, where cues a 
simple, internally consistent, and not patently obtrusive, subjects 
confidence in their speculations about the experimental hypotheses 


would presumably be high, and volunteers, because of their 


higher approval need and consequently greater influenceability, 
onvolunteers to the straight- 


will be more accommodating than n 1 t 5 
forward demands operating in such a situation. Thus, it was foun 


in Experiment I that volunteers became more positive than non- 


volunteers after a simple positively-slanted communication and 
fter a negatively-slanted com- 


more negative than nonvolunteers & : 
munication. Toward the other end of the continuum, where 
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cues embedded in a complicated deception procedure might them- 
selves appear highly complex and confusing, or where cues as- 
sociated with a bi-directional stimulus might seem inconsistent and 
even contradictory, or where cues might seem too patently ob- 
trusive and the experimenter’s expectations suspiciously blatant, 
subjects’ confidence in their speculations may be lower, and only 
negligible differences, or else very subtle differences, may result 
between volunteers and nonvolunteers. Thus, when a deception 
procedure was used in Experiment II to mask demand characteris- 
ties, only negligible net differences in the spreading apart of 
alternatives resulted between volunteers and nonvolunteers. Simi- 
larly, negligible differences resulted between volunteers and non- 
volunteers in Experiment I when the bi-directional stimulus was a 
two-sided personality-impression communication, and subtle, but 
rather weak, differences were revealed in regard to a check on the 
manipulation of importance in Experiment II. The situation in 
Rosnow and Rosenthal (1966) might be conceived as falling some- 
where around the middle of the continuum, volunteers possibly 
perceiving the dominant demands as being highly obvious, and non- 
volunteers, who were perhaps confused by what may have ap- 
peared to them to be simultaneously conflicting demands, perceiv- 
ing the overall situation as being Jess obvious. The average of their 


collective perceptions might thus fall somewhere in the middle 
Tange. 


Check on These Assumptions 


Because these assumptions are entirely speculative, an attempt 


was made to gather data relevant to them. A procedure was used 
for this purpose, which was in principle roughly like the “quasi- 
control” procedures recently outlined by Orne5 During a regularly 
scheduled class meeting, 29 undergraduate women at Temple 
University, who were unfamiliar with this research, were ad- 


i a booklet containing case studies of five experimental 


situations: (Although we attempted to manipulate volunteer status, 


— 

b " 
h uu (0) pri Outlined several useful procedures for uncovering the 
fact that subjects, sa of demand characteristics. All have in common the 
the context P which oma be haw termed “quasi-control subjects,” reflect upon 
i „= given experiment is conducted and speculate on ways 


B IA the context might influence their own and research subjects’ be- 
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unfortunately none of the raters volunteered for an unrelated 
psychological experiment.) After each case, the subject was asked 
to speculate on what the experimenter hoped to find and to esti- 
mate on a 101-point scale (a) how obvious the experimenter's ex- 
pectations seemed to her and (b) how much confidence she had in 
her speculations. In each instance the subjects were provided with 
materials identical to those that had been used in the actual re- 
search and a full account of the experimental procedure. The five 
conditions, in the order in which they appeared in the booklet, 
were as follows: 

Case I. The subject was asked to imagine herself in a situation 
like that used in Experiment II, where she had to choose between 
two ideas that she had earlier evaluated and then to re-appraise the 
importance of the choice alternatives. To simulate the high impor- 
tance condition of Experiment II, the subject was told to imagine 
that she had been instructed that her choice would guide the admin- 
istration of the University in selecting a new experimental program 
to be instituted. 

Case II. The subject was asked to imagine herself in the same 
situation as Case I, but to assume that the experimenter had now 
asked her to evaluate the overall importance of the study. 

Case IIT. To simulate one of the conditions in Experiment I, the 
subject was asked to imagine that she was to evaluate “Jim” on 
the basis of the one-sided, negatively-slanted information provided 
in that study. 

Case IV. Another condition from Experiment I was simulated by 
asking the subject to imagine herself in a similar situation as in 
Case III, but that now she had to evaluate “Jim” on the basis of 
negatively-slanted and positively-slanted information. 

Case V. To simulate one of the conditions in Rosnow and 
Rosenthal (1966), the subject was asked to imagine that a com- 
municator, whom she believed to be anti-fraternity, measured her 
Opinions of college fraternities after he had read to her & pro- 
fraternity persuasive communication. 

The results, which are shown in Table 4, were treated by the 
Newman-Keuls method for determining differences between re- 
peated measures. The only difference between means that was 
significant at the .05 level was that for obviousness of intent be- 
tween Case III and all other Cases. With the exception of Case V 
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TABLE 4 
Ratings for Obviousness of Experimenter’s Expectations and for Raters’ 
Confidence in Their Speculations 
Task Obviousness Confidence 
Case I 59.31 68.51 
Case IL 55.86 63.62 
Case III 71.89 73.10 
Case IV 58.11 60.10 
Case V 57.25 63.39 


Note.—Values could range from zero, indicating that the experimenter's expectations were 
not obvious or that the rater had absolutely no confidence in her speculations, to 100, very obvious 
or very high cof e 


which was expected to fall somewhere in the middle range relative 
to the other four cases, these differences for obviousness are gen- 
erally consistent with the assumptions, Case III falling toward the 
high end of the continuum and Cases I, II, and IV falling toward 
the low end. The failure of Case V to correspond to our expecta- 
tion could be linked to the fact that the sample was comprised en- 
tirely of nonvolunteers, who perhaps were unable to differentiate 
dominant from subordinate demands and thus tended to view the 
overall situation in the same way that they may have regarded 
Case IV. As for the finding that no differences significant at the 
:05 level resulted between mean ratings for confidence, this might 
be due to the criterion not having been clearly enough defined or 
perhaps to the fact that the sample was comprised of less sensitive, 
nonvolunteer subjects. It can be noted, however, that Case III was 


rated highest in confidence, which of course is consistent with the 
assumptions. 


Generality of Volunteer Effects 


One other point can be made concerning the generality of the 
findings obtained thus far. In all three exploratory studies com- 
parisons were made only between students who indicated that they 
would participate as research subjects and those who indicated that 
they would not. Obviously not all volunteers show up for experi- 
ments, and there is evidence to suggest that “no-shows” may be 
more like nonvolunteers than they are like the volunteers who 
keep their appointments (Conroy and Morris, 1968; Levitt, Lubin, 
and Brady, 1962). Thus, comparing nonvolunteers with verbal 
volunteers may in fact be a comparison between nonvolunteers 


x EE 
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and some other nonvolunteers mixed in unknown proportion with 
true volunteers. The point is that differences such as those re- 
ported here between nonvolunteers and verbal volunteers may tend 
to underestimate differences that would actually result between 
volunteers who showed up for an experiment and nonvolunteers. 


Summary and Conclusions 


Two exploratory studies were reported in which the reactions to 
experimental treatments of volunteer and nonvolunteer subjects 
were compared. The overall results were not inconsistent with the 
notion, alluded to by Orne and others, that volunteers may more 
often play the role of the “good subject,” acting in ways which 
would be perceived as leading to confirmation of the experiment- 
er’s hypotheses. 


For the volunteer subject to feel that he has made a useful 
contribution, it is necessary to assume that the experimenter 
is competent and that he himself is a “good subject.” The 
significance to the subject of successfully being a “good sub- 
ject” is attested to by the frequent questions at the conclusion 
of an experiment, to the effect of, “Did I ruin the experiment?” 
What is most commonly meant by this is, “Did I perform well 
in my role as experimental subject?” or “Did my behavior de- 
monstrate that which the experiment is designed to show?” 
(Orne, 1962, P. 778) 


Our findings weré generally supportive of the following hypothe- 
ses: (a) that the experimental effects of behavioral research may be 
a partial function of subjects’ volunteer status and (b) that volun- 
teers may be more sensitive and accommodating than nonvolun- 
teers to demand characteristics. Based on these and earlier findings, 
it was posited that the latter relationship may be mediated to some 
extent by the obviousness of the demand characteristics operating 


in any given situation. 
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EFFECTS OF PROMISED REWARD AND 
THREATENED PENALTY ON PERFORMANCE 
OF A MULTIPLE-CHOICE VOCABULARY TEST 


__ROSS E. TRAUB, RONALD K. HAMBLETON, Ax» BALWANT SINGH! 
P The Ontario Institute for Studies in Education 


In research on multiple-choice tests some attention has been 
j paid to the effect on test performance of instructions about guess- 
ing. The focus of this attention has been on instructions of two 
types. The first encourages the examinee to guess when he does 
- mot know an answer and informs him that his score will be the 
A number of correct answers. The second encourages the examinee 
ot to guess by telling him there is a penalty for wrong answers 
‘and that his score will be the number of correct answers minus 
some fraction of the number of wrong answers. 
_ Research comparing the effects of these two types of instructions 
"on test behavior has been reported by Blommers and Lindquist 
(1965), Keislar (1953), Michael, Stewart, Douglass, and Rainwater 
A (1963), Ruch and De Graff (1926), and Swineford and Miller 


Number of questions attempted to a level significantly below the 
“umber attempted under instructions to guess unknown answers. 
Associated with this result is the observation that the number of 
“correct and incorrect answers are significantly lower when the 
"instructions include a penalty for wrong answers.” 

J 1 The auth teful to the following principals of Toronto schools 
xor their odo Di the data collection phase of the study: Mr. N. Allen, 
St. Clair Junior High School Mr. W. C. Macready, Leaside High Saeed 
‘Mr. H. A. McCordie, Westwood Junior High School and Mr. D. R. Smith, 


“Sak Park Junior High School. 
| *Às might eed instructions which tell a student not to guess but 


847 
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More interesting perhaps are the results concerning reliability. 
For scores consisting of the number of correct answers, Swine- 
ford and Miller (1953) found the highest KR-20 estimate of re- 
liability in the group given the penalty instructions. The lowest 
reliability was observed under instructions to answer every ques- 
tion, Blommers and Lindquist (1965), Keislar (1953) and Ruch 
and De Graff (1926) obtained similar results using split-half esti- 
mates of reliability. In the light of Mattson’s (1965) theoretical 
study, these reliability results may be interpreted as evidence that 
threat of a penalty for wrong answers reduces guessing. This 
follows because Mattson showed, within limits, that guessing at- 
tenuates reliability. 

The purpose of the present investigation was to study the effect 
on test-taking behavior and test reliability of a third type of in- 
struction. As in the second type, deseribed above, the intention of 
this instruction is to encourage examinees not to guess. But instead 
of threatening them with the loss of marks for wrong answers, the 
instructions promise a small reward for omitting items for which 
answers are not known. That is, the examinees are told their scores 
will consist of the number of questions answered correctly plus 
a fraction of the number omitted. 

Tt was our expectation that the promise of reward would be more 
effective than the threat of a penalty in geting examinees to omit. 
The argument that can be offered in support of this expectation is 
that the instructions promising a reward apply specifically and 
directly to the desired behavior, that is, the behavior of omitting 
questions when the answer is not known. On the other hand, the 


“FON UR Se EN TTE a MN NND 
say nothing about a penalty in scoring have a less marked effect on test 
erformance. Taylor (1966) found that examinees told not to guess did not 
in the average number-correct and formula (corrected-for-guessing) 
Scores from examinees encouraged to guess and those told nothing about 
pe However Taylor's results seem to indicate that the group told not 
/ Ene had more omitted questions and fewer incorrect answers. Also, as 
might be expected, Waters (1967) has shown that the effectiveness of penalty 
pue. depend on the magnitude of the penalty. It would appear that 
he number of omits is a negatively-accelerated, increasing function of the 
pu on threatened penalty, for wrong answers, Waters’ data also suggest 
i at the number of correct answers decreases as a negatively-accelerated 
unction of the size of the penalty. The effect of the magnitude of the penalty 
eu the number of correct answers appeared to reach an asymptotic level within 
the range of penalties studied by Waters—the range of penalties was from 


zero to four points per question. Ho i t appear to 
have been reached for Ae cabo of m yopo not epp 
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hreat of a penalty encourages omissiveness indirectly and only as 
correlated result of getting examinees concerned about reducing 
number of questions they answer incorrectly. 
The following empirical questions were asked: Does the promise 
of reward result in more omitted questions than the threat of a 
Jenalty? Are these two instructions associated with different aver- 
ge numbers of correct and incorrect answers and formula (cor- 
ted-for-guessing) scores? Does promise of reward produce 
More reliable scores than the threat of a penalty? Do different 
“instructions about guessing affect the examinees’ test taking strate- 
gies and their opinions of how well the test assesses their capabili- 
ties? The present investigation was designed to provide answers to 
hese questions. 
For completeness and in order to make the results of this 
Study more comparable with previous research findings, two addi- 
tional instructional conditions were included. One told the exami- 
“hee he could guess and that his score would be the number of 
Correct answers. The other instruction said nothing about guessing 
but did tell the examinee that his score would consist of the num- 
ber of correct answers. 


Method 


The subjects for the study were 667 ninth-grade students (340 
girls and 327 boys) from four schools in Metropolitan Toronto, 
"Ontario. 

- "The students were administered two forms of the Dominion 
Vocabulary Test. Each form of this instrument is composed of 90 
8y nonym items in a five-option multiple-choice format. The test 
Was standardized in Ontario for administration in 20 minutes to 
‘Students in grades 9-13. This particular measure of vocabulary was 
‘Chosen because information in the manual indicated the test to be of 

than middle difficulty for ninth-grade students. Thus it was 
pected that a substantial number of the vocabulary items would 
e unknown to ninth-grade children and, as & consequence, would 
Provide ample opportunity for guessing. 


) The vocabulary tests were admi in the following way: 


est and marking their answers on & machine scorable sheet. Then 
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the students were asked to read silently the instructions on the 
covers of their test books. 

Each student read one of four instructions. Which instruction a 
student read was determined by a random process, subject only to 
the restriction that insofar as the situation could be controlled, the 
same number of students received each instruction.’ This was done 
by randomizing the instructions in blocks of 32 test books. In each 
block, the four instructions each appeared on eight test books, In 
the testing room the prerandomized block of 32 books was doled 
out in order to the first 32 students in the room. Then the next 
prerandomized block of 32 was given out and so on. During the 
test administration no mention was made of the fact that there were 
four different instructions and the proctors saw no indication dur- 
ing the testing session of an awareness on the part of the students 
that there were different instructions. The four instructions were 
as follows: 

1. Promise of Reward: You may guess the answers to unfamiliar 
words. If you guess correctly, you will get a mark for 
having the right answer; if you guess incorrectly, you will not 
get a mark for the question. However, if you omit the ques- 
tion, you will get a bonus of one-fifth (1/5) of a mark. Thus 
your score on this test will be the number of correct answers 
plus one-fifth (1/5) the number of questions omitted. 

2. Threat of Penalty: You may guess the answers to unfamiliar 
words. If you guess correctly, you will get a mark for having 
the right answer; if you guess incorrectly, you will be penal- 
ized one-quarter (1/4) of a mark. However, if you omit the 
question, you will not get a mark for the question. Thus your 
score on this test will be the number of correct answers 
minus one-quarter (1/4) the number of incorrect answers. 

- Guess: You may guess the answer to unfamiliar words. Your 
Scores on this test will be the number of questions answered 
correctly. 

No Reference to Guessing: Your score on this test will be 
the number of questions answered correctly. 


e 


——————— 
In spite of this coj 2 i m 
same N. This was UIT the experimental groups did not contain the 


$ y the failure of some examinees to mark the 
answer sheet according to instructions and by the incomplete data produced 


by students forced to leave during th i i i 
e itin ness, or 
some other commitment, E ar Aras Er Ap 
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Instructions 1, 2 and 3 were designed to give the examinee notice 

that he could guess if he so decided—a fact that should be obvious 
— anyway—and to inform him of the consequences of correct and 
- incorrect guesses. Designed in this way, the three types of in- 
structions are unambiguous in the sense that the examinee has 
€ enough information to formulate an “optimal” strategy for working 
É the test provided he knows something about probability theory. 

This type of instruction is recommended by Edwards (1961). 
—— When the students had finished reading the instructions they 
| were given a twenty minute period in which to work the first form 
of the vocabulary test. At the end of the period the answer sheets 
Were collected. Then the next twenty minutes were devoted to 
"Working the second form. It should be noted that for approximately 
_ half the students getting each type of instruction, form A of the 
| vocabulary test was given first and form B second. For the other 
half of the students the order of forms was reversed. When the 
^ second form had been completed, the answer sheets and test books 
? were collected and the students were given a questionnaire to 
- complete. The questionnaire was designed to find out if the stu- 
ts remembered what the instructions had said about guessing, to 
find out what approach had been adopted by students in working 
. the test, and to find out what each student thought of the test as 
measure of his vocabulary. About ten minutes were required to 
a mplete the questionnaire. 


Results 


Effects of Instructions on Test Scores 


"Statistics are reported in Table 1 summarizing how both forms 
the test were performed under each type of instruction. From 
“the means recorded in Table 1, it is clear that the instructions 
differentially affected the average number of correct answers. The 
;Broup promised a reward and the one threatened with a penalty 
did not differ much between themselves and had, on the average, 
arly three fewer correct answers per test form than the two 
ther groups who also did not differ greatly between themselves. 
 Thtergroup differences were much larger on the number of in- 
Correct and omitted items. The biggest differences were between 


group promised a reward for omitted items and the one told 


by 


852 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Means and Standard Deviations of Four Scores Obtained on Two Forms of a 
Vocabulary Test Written Under Four Types of Guessing Instructions 


Forms written in Forms written in 
Order AB Order BA 
Guessing Form A Form B Form A Form B 
Instructions ^ Score X d X —d aie ted! dE 
Promise of 
Reward (N = 84) (N = 85) 
Correct 37.0 15.3 39.4 13.9 35.8 14.3 37.4 12.9 
Incorrect 31.9 15.0 35.6 14.7 231.8 15.7 28.9 13.3 
Omitted 21.1 15.9 15.0 13.3 22.3 17.8 23.7 10.5 
Formulas 29.0 17.3 30.4 16.3 27.9 15.9 30.3 14.0 
"Threat of 
Penalty (N = 83) (N = 81) 
Correct 38.7 15.2 39.5 13.5 34.7 14.3 36.8 13.2 
Incorrect 38.4 15.2 39.4 15.6 42.8 16.9 38.5 15.8 
Omitted — 12.9 13.9 11.1 14.8 12.5 14.6 14.7 14.8 
Formula* 29.1 17.7 29.6 15.8 24.0 17.0 27.1 15.5 
Guess (N = 83) (N = 89) 
Correct, 39.5 14.5 41.0 13.3 40.3 15.5 42.0 15.0 
Incorrect 39.6 13.1 42.4 14.0 44.7 16.1 41.2 15.5 
Omitted 10.9 15.1 6.6 13.6 5.0 9.6 6.8 8.9 
M Formulae 29.6 16.2 30.5 15.4 29.2 18.9 31.7 18.3 
o 
[genius (N = 85) WN =177) 
Correct 39.0 14.7 41.1 14.2 40.2 13.7 41.5 13.1 
Incorrect 39.8 15.3 42.2 15.7 40.6 14.9 36.8 14.3 
Omitted 11.2 14.3 6.7 11.5 9.2 12.0 11.7 13.6 
Formulas 29.0 17.0 30.6 17.2 30.1 16.4 32.3 15.3 


* Formula score = Number correct — (Number i t) /4. An additi: tion would 
obviously have been most appropriate for the incorrect) /4. itive correcti 4 
paring a group promised a reward, but for the purpose o 
Sen groups the subtractive correction is sufficient since, as is well known, when the additive 


AI ES ions are applied to the same test data, scores are obtained that are perfectly 


to guess when in doubt; the former group averaged nearly ten 
fewer incorrect answers and nearly 13 more omitted questions 
than the latter. It should also be noted that for both the number 
of incorrect answers and the number of omitted questions, the 
differences between the group promised a reward and any one of 
the other three groups were larger than were the differences 
among the other three groups themselves. 

The fourth Score computed for each student was a formula 
pa consisting of the number of correct answers less one-fourth 
the number of wrong answers. Intergroup differences in mean 
formula scores were small. 


Statistical analysis of the four scores was done using four separate 
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three-way analyses of variance. The factors in the analyses were 
instructions, test forms, and orders, where the "orders" factor simply 
marked the fact that half the examinees receiving each instruction 
took the test forms in the AB order while the other half took them 
in the BA order. There were repeated measures on one factor 
(test forms). All factors were interpreted as fixed. As a preliminary 
step to the four analyses of variance, tests were made to see 
whether the assumption of homogeneous variances could be re- 
jected (Winer, 1962, pp. 339-340). It could not. 

The analysis of variance results may be summarized as follows. 
The main effect of instructions was significant in three analyses: 
those of the number-correct scores (F = 3.86; df = 3,608; p S 
01), the number-incorrect scores (F = 13.32; df = 3,608; p 
X 001), and the “omit” scores (F = 6.52; df = 3,608; p < 
.001). Subsequent comparison of pairs of experimental group means 
using the Newman-Keuls procedure indicated the following: For 
the number-correct scores, none of the comparisons was significant 
at the 5 per cent level despite the fact that significance was at- 
tained in the analysis of variance. Such a result is possible (Winer, 
1962; Hopkins and Chadbourn, 1967) simply because analysis of 
variance uses the experiment as the base for establishing a, the 
probability of an error of the first kind, whereas the Newman- 
Keuls procedure employs the individual comparison as the unit for 
setting a. It was nevertheless true that the conventional level of 
significance was almost attained (p < .06) for four comparisons: 
the mean number-correct scores for the group promised & reward 
and the group threatened with a penalty were both lower than 
the means of the other two groups. Turning next to the number- 
incorrect scores, the mean for the group promised a reward was 
significantly lower than the means of all other groups (p < .01). 
Mean differences in number-correct scores among the groups 
threatened with a penalty, told to guess, and told nothing about 
guessing were not significant. For the omit scores, the differences 
between all pairs of group means were significant (p < .05). 

In addition to the main effect for instructions, there was one 
Other significant main effect in the analyses of variance: it was 
for test forms and occurred in all four analyses. Form B of the 


—— 
*In order to achieve equal cell frequencies, the data for 51 students was 
excluded on a random basis from the analyses of variance. 
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test was easier than form A as evidenced by the higher mean 
number-correct and formula scores associated with Form B and, 
overall, the higher mean number-incorrect and omit scores as- 
sociated with form A. The only other significant F-ratios from the 
analyses of variance were for the interaction between test forms 
and orders. They occurred in the analyses of the number-incor- 
rect, omit and formula scores. On the face of it, this interaction 
might be interpreted as evidence of a significant practice effect. 
This is because (as described by Stanley, 1955) the design of this 
experiment confounds any practice effect with the test forms by 
orders interaction. However, an explanation of the interaction in 
terms of a practice effect is not adequate because there was not à 
significant interaction term for the number of correct answers. 
This suggests that practice on one test form did not increase an 
examinee’s expected number of correct answers on the second form 
above what it would have been had he written it first. Moreover, 
inspection of the results in Table 1 indicates that for the test writ- 
ten second—independently of which form it is—there are fewer 
omitted questions and more wrong answers. It looks as if the extra 
questions attempted on the form of the test written second were 
all answered wrong, that is, at a worse-than-chance level. Perhaps 
the significant interactions were produced by testing fatigue 
which caused at least some students in all instructional conditions 
to become less inhibited about guessing and at the same time 
increasingly distracted by misinformation or attractive features of 
the wrong answers to a question, 

One of the questions in the questionnaire asked the examinee to 
identify the type of instruction about guessing given in the test. 
Another question asked him to identify the information contained 
in the instructions about how the test would be scored. On the 
basis of responses to these two questions it was possible to identify 
and cull the examinees who did not remember the instructions as 
evidenced by the fact that they answered one or both questions 
incorrectly. It was found that 128 or 76 per cent of those promised 
a reward correctly remembered the instructions. Similar figures 
for the other experimental groups were as follows: threat of penalty, 
121 or 74%; guess, 105 or 61%; no instructions, 81 or 50%. (A 
x’ test of contingency revealed that the differences among the 
experimental groups in the proportion of students correctly remem- 
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bering the instruetions were significant: x = 3139; df = 
p< .001). 

A comparison was made of the test results for the examinees 
who correctly recalled the test instructions and the results for the 
total group. Minor differences were observed. However, it is cor- 
rect to say that the same pattern of inter-experimental group dif- 
ferences was obtained from the data of the reduced sample as from 
the data of the total sample. 


Effect of Instructions on Reliability 


As an estimate of reliability, the interform correlation® was com- 
puted for each of two scores, the number-correct and the formula 
scores. The correlations were computed in the following way: 
each experimental group was divided into subgroups on the 
basis of the order in which the forms were written. An interform 
correlation was computed separately for each subgroup of each 
experimental group and then the two resulting correlations were 
averaged. (The Fisher z transformation was employed in calculat- 
ing the averages.) 

The interform correlations are reported in Table 2. It is clear 
that for both number-correct scores and formula scores the highest 
correlation was observed in the group told to guess when in doubt. 
Only slightly less was the correlation for the group promised a 
reward. The two remaining groups had still smaller correlations 
with the smallest for the group threatened with a penalty. 

A conventional test of significance was made of the difference 
between pairs of correlations (McNemar, 1961). Of the correla- 
tions based on number correct scores, the one for the group told 


5 This correlation provides the classical estimate of reliability given parallel 
test forms and scores for a population of examinees (Gulliksen, 1950). In the 

that the tests are not parallel (the analysis 
difficulty; also, inspec- 
tion of Table 1 reveals that form A had a consistently higher standard devi- 
cases the difference was small.) Moreover, 
too small for the group statistics to 
f population parameters. Thus the 
vide “best” esti ates of reliability 
differences among 4 R 
ment i i i may be expected to produce if- 
eee pape EENT e Jinearly predictable 


rom score rm. The interform correlation is a useful measure 
s on the other forn be regarded as a measure of 


of linear predictability and in some sense can 
reliability. 
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TABLE 2 


Interform Correlations for Number-Correct and Formula Scores for Each 
of Four Different Types of Guessing Instructions* 


Type of Score 


Type of Instruction N Number-Correct Formula 

Promise of Reward 169 (128) .924 (.941) .922 (.939) 
"Threat of Penalty 164 (121) .888 (.879) .894 (.884) 
Guess 172 (105) .932 (.939) .933 (.937) 
No instructions 162 (81) .896 (.904) -904 (.913) 


* Numbers in brackets are based on the students who in the posttest questionnaire indicated 
that they correctly remembered the instructions about guessing given in the test. 


to guess was significantly higher (z > 1.97; p < .05) than the 
correlations for the group threatened with a penalty and the one 
told nothing about guessing. None of the differences between 
other pairs of correlations achieved the conventional level of sig- 
nificance, although the difference between the correlation for the 
group promised a reward and the one threatened with a penalty 
approached significance (z = 1.84, p < .07). Differences between 
interform correlations based on formula scores were generally 
smaller. A difference significant by conventional standards was 
observed only between the correlations for the group told to 
guess and the grroup threatened with a penalty (z = 2.17; p < 
05). 

Slightly different results were obtained when reliabilities were 
estimated from the data of only those students who gave evi- 
dence on the questionnaire of correctly remembering the instruc- 
tions to the test. These estimates of reliability are reported in 
brackets in Table 2. In this case, the reliability estimate is highest 
for the group promised a reward although the correlation for the 
group told to guess is only slightly less. Next in order of size 
come the estimates for the group given no instructions about 
guessing and the group threatened with a penalty. For both num- 
ber-correct and formula scores the reliabilities for the group pro- 
mised a reward and the group told to guess were significantly 
higher than the Teliability of the group threatened with a penalty 


(2 > 2.54, p < 02). The differences between all other pairs of 
correlations were not significant. 


—————————— 
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ional Questionnaire Results 


t will be recalled that there were significant differences among 
experimental groups in the proportion of examinees in each 
group who demonstrated by their responses to the first two ques- 
nnaire items that they correctly remembered the test instructions 
out guessing and about how the test would be scored, There 
e significant differences in the pattern of group responses to 

other questionnaire items. The items and the percentage of 
ponses in each category of the items for each experimental 
up are reported in Table 3. It should be noted that the results 
Table 3 are based on the responses of all examinees. Results for 
ly those examinees who correctly recalled the test instructions 
not reported because they are very similar to the results for 
total sample. 
ere were three questions in the questionnaire which failed 
differentiate among the experimental groups. These were: “How 
do you think your score on the vocabulary test will represent 
true knowledge of word meanings?” (with responses vary- 
on a five category scale from very well to very poorly), 
hat kind of test do you like best?” (with response alternatives 
essay test, multiple-choice test and no preference) and “What 
of test do you think gives the best measure of your true 
owledge of a subject?” (with response alternatives of essay test, 
iple-choice test, it depends on the subject matter being tested 
d no preference). 
the basis of these results it can be concluded that the re- 
onse patterns of the experimental groups to the items on the 
ionnaire were what might reasonably be expected. Apparently 
taking strategies were affected differentially by the different 
sing instructions; more students in the group told to guess 
ally reported making random guesses than students in the 
ps threatened with a penalty or promised a reward. More- 
; judging from the questionnaire responses of the students, it 
clear that the instructions promising & reward were more 
y enalty in producing 
vior deviating from the norm pro ided by the respon yy ipt 
oup given no instructions about, guessing. Certainly a 
ard was a more salient instruction than all the others in that 
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TABLE 3 


? $ | 
Percentage of Responses in Each Experimental Group to Questionnaire Items | 
for Which Inter-Group Response Patterns Differed Significantly 


Experimental Group 
Re Pe Ge Na 
Question (N =169) (N 2164) (N 2172) (N = 162) 


3. What did you decide about 
guessing before starting 
to work the test? 
(a) Guess when in doubt. 23 48 75 55 
(b) Not to guess when in 
doubt. 20 12 
(c) No decision. 57 40 
x? = 78.15 with 6 df 


4, Was your decision about 
guessing affected by the 
instructions given in the 
examination? 
(a) Yes. 48 49 44 14 
(b) No. 33 4 59 
(c) Do not know. 20 18 

x? = 53.12 with 6 df 


5. What did you actually do 
when you came across an 
unfamiliar word in the test? 
(a) Always guessed. 6 32 56 26 
(b) Guessed only if 1, 2 or 
3 choices could be 
eliminated as wrong. 86 66 42 69 
(c) Always omitted the 
question. 8 2 2 5 
xX = 67.31 with 6 df 


7. Do you think you can 

ies your score on a 

vocabulary test by guessing 

when you do not know the 

answer? 

(a) Yes. 22 38 50 48 

(b) No. 47 35 24 al 

(c) Not sure. 27 26 21 
X = 283 with 6 df 

* Group promised a reward for omitted items, 

* Group threatened with a penalty for wrong answers, 
IP given no instructions about guessing, 


Bro 
> 
© 


e 

t 
on 
to 
a 


it resulted in the highest proportion of students who could cor- ` 
rectly answer questions about the content of the instructions. Also 
important to note is that the groups did not differ on questions of 
general opinion about test preferences. These results suggest that 


, average 1.6 more questions correct; 
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the questionnaire was a valid measuring instrument; it discrimi- 

nated among the experimental groups on issues where differences 

might reasonably be expected and it failed to discriminate where 

differences were unlikely to be produced by the test instructions. 
Discussion 

The results of this investigation indicate that promising exami- 
nees a reward for omitting questions they cannot answer is a more 
effective means of eliciting omissive behavior than threatening 
examinees with a penalty for wrong answers. The results also 
indicate that promise of reward decreases the number of wrong 
answers but seems to produce no difference in the number of 
correct answers. Equally worthy of note is that promise of reward 
resulted in significantly more reliable test scores than threat of 
penalty, at least for those students who, in the posttest situation, 
correctly remembered the test instructions. 

The practical implications of these findings are obvious: If the 
test constructor desires to maximize the amount of omissive be- 
havior, with the probably correlated result of reducing the amount 
of guessing, the present investigation suggests that it is more 
effective to use instructions which promise a small reward for 
omitting questions that cannot be answered, rather than to use 
instructions that threaten a penalty for incorrect answers. 

In addition to the aforementioned practical implications, the 
results of the present investigation point up again the well recog- 
nized inadequacies of the simple guessing model—that is, the model 
which assumes that if a question is known it is answered correctly, 
otherwise it is omitted or the answer is guessed at random among 
all the alternatives given. One way the inadequacy of this model 
may be seen is that groups which omitted fewer items did not get 
Precisely the number of additional correct and incorrect answers 
that the model would predict. For example, the group threatened 
With a penalty attempted eight more questions per test form on 
the average than the group promised a reward, but they did not 
rather they averaged nearly 
Zero more questions correct and eight more questions incorrect. 
Other evidence that the simple guessing model is inadequate A 
that the group told to guess achieved scores that were as reliable 
38 the scores of the group promised a reward. Yet according to the 
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simple guessing model, it should have turned out that the group 
told to guess would have less reliable scores because their scores 
would be expected to contain more random variance due to guess- 
ing. 

A more appropriate model is one which asserts that examinees do 
not guess at random, but rather they guess after using partial in- 
formation to rule out one or more of the alternatives as incorrect. 
Such a model almost certainly comes closer to providing a de- 
scription of multiple-choice test behavior. But the partial informa- 
tion model is also inadequate to explain all the intergroup differences 
observed in the present investigation. How, for example, can par- 
tial information be used to account for the relatively high re- 
liability of the group instructed to guess when the results of the 
group given no instructions about guessing are considered. The 
group given no instructions differed very little from the group 
told to guess in the number of correct and incorrect answers and 
the number of omitted questions, indicating that the group given 
no instructions responded with almost the same level of partial 
information, yet it had a much lower reliability. 

The problem of developing a model of test behavior to explain 
the obtained results is unsolved. It seems likely, however, that an 
adequate model would have to take account of some personality 
variable such as propensity to guess or propensity to omit and its 


interactions with level of partial knowledge and type of test in- 
structions. 


Conclusion 

The present investigation provides evidence supporting the con- 
clusion that promise of reward for omitted items is a more effective 
method for getting examinees to omit than is the threat of a penalty 
for wrong answers. An associated conclusion that holds at least for 
examinees who correctly recall the test instructions after the test is 
completed is that scores under promise of reward are more re- 
liable than scores under threat of penalty, These conclusions are 


of course limited to the vocabulary tests, the instructions and 
student population used in this investigation. 
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PROVISION FOR PUBLICATION OF VALIDITY 
STUDIES OF ACADEMIC ACHIEVEMENT 


Early in the life of this journal it became evident that the pre- 
diction of academic achievement is by far the most popular area 
of research in the measurement field. It also became apparent that 
unless heroic measures were taken, the journal might easily be 
practically monopolized by this subject. The heroic measure re- 
sorted to for a while was simply not to publish any studies on the 
prediction of academic achievement. 

In the course of time, it became evident that the solution hit 
upon was too drastic. After all, it is important that validity reports 
be available, at least in condensed form, to educational and per- 
sonnel psychologists and to school counselors who wish to evaluate 
the relative merits of the various instruments ayailable for the 
prediction of academic achievement. Furthermore, it appears that 
a substantial amount of validity data cannot be conveniently com- 
municated to professional workers in the field of measurement unless 
provision is made for publication in a professional journal. 

In the light of this situation, the policy has been adopted of 
publishing a section devoted to such studies in the form of extra 
pages for which the authors bear most of the publication costs. 
This policy allows publication of the usual number of pages on 
other subjects in the measurement field. The charges consist of 
thirty dollars per page of running text plus any extra costs which 
may be involved in the composition of tables, figures, and for- 
mulas. Authors are furnished one hundred off-prints without, ex- 
tra charge. 

Preference will be shown 
words, with no more than six references and con g 
fewer tables each of no more than one 8 1/2" X 11" elite typed 
page—making six printed pages. Any manuscript exceeding 3000 
words, 12 references, and four tables or figures equivalent to three 
8 1/2” x 11” typed pages will be automatically returned, as 12 
printed pages will be the maximum total number of pages for any 
article to be published in this section. 4 Set eels. 

The Validity Studies of Academic Achievement Section is us 
lished twice a year, once in the Summer issue and again in the 
Winter issue, for which the closing dates for receiving manuscripts 
are November 30th and May 30th, respectively. 

Two copies of the manuscripts should be sent to: 


Dr. William B. Michael 
325 Callita Place 
San Marino, California 91108. 


for manuscripts of fewer than 1200 
taining two or 
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WHAT THE ITPA MEASURES: A SYNTHESIS OF 
FACTOR STUDIES OF THE 1961 EDITION! 


C. E. MEYERS 
University of Southern California 


Tum Illinois Test of Psycholinguistic Abilities (McCarthy and 
Kirk, 1963) is perhaps the first successful break from the Binet- 
Wechsler tradition in the psychometric appraisal of young chil- 
dren. Designed by its authors to describe the child’s functioning in 
order to guide remedial steps, rather than to give him a single classi- 
ficatory score, the new instrument has had a considerable impact 
on the conceptualization of disabled school learners and on their 
programming. While the instrument is a tool for a profile deserip- 
tion of a child, it happens also to be frankly based on a differen- 
tiated rather than a general concept of human intellect. Do the 
nine subtests each measure an independent ability? Or is there evi- 
dence that they separately sample the functions listed in psycho- 
linguistic theory (levels of representation, processes, and channels 
of communication)? The factor analytic approach is one means of 
determining an answer. There have been (to the present writer's 
knowledge) 16 different factor studies employing some or all parts 
of this scale. Those studies judged of value to answer the basic 
question were compared for this synthesis. One is an original analy- 
sis on data from the McCarthy and Olson (1964) validation re- 
port, and another is a re-analysis of the Haring and Ridgway 
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(1967) report from a correlation matrix supplied by Dr. Haring. 

"The several factor studies have been divided into three varieties 
according to their utility for this synthesis. They are listed in Table 
1. Type A are the seven analyses of extensive intercorrelation 
matrices in which a considerable number of other ability variables 
were combined with the nine of the ITPA. These provide the back- 
bone of this report. Of the seven, Horner's (1967) could not be 
utilized in the final synthesis shown in Table 2, though it con- 
tributed to the report in other respects. Horner's study used only 50 
cases with an extreme range of IQ and yielded, among four po- 
tentially useful factors, two of which were bipolar and hence im- 
possible of interpretation in terms of abilities, which logically can- 
not have negative values. Type B includes four studies in which 
one or more ITPA variables were employed in factor seeking ex- 
plorations. All these could be used, though their results naturally 
are limited to the ITPA subtests employed. The type C studies 
factored either only the ITPA variables, or these with few others. 
They were mostly unusable because factors based on the limited 
matrix tend to mislead, as is the case with analyses based on WISC 
or WAIS subtests alone. Of the five type C studies, only the report of 
Quereshi (1967) was employed; it added only the Binet MA to the 
matrix, but it had such numbers and was of such sophistication 
otherwise to warrant inclusion. The remaining analyses of the C 
category were not employed (they are listed for the possible use of 
other investigators). As these studies factored a very limited matrix 
&nd were hence barren of the reference tests which might have 
brought out latent structure in the ITPA, they gave rise to an exces- 
BING “general factor,” one of the results usually found in limited 
matrices. As none of the results of these analyses, except for the 
general factor, was in contradiction with the pattern developed in 
synthesis based on the A and B. types, it was decided to ignore them 
for the present purpose. 

The technical quality of the factor investigations which were 
used had to be taken for granted in most instances, because of a 
paucity of descriptive information. Such probably did not matter, 

. for the eleven studies employed had observed the logical criteria 
of positive manifold and simple structure. The eleven included two 
analyses with factor analyses performed by the present investi- 
gator. For these, the procedure was a computerized principal com- 
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TABLE 1 


List of Factor Analytic Studies Involving the ITPA, + 
Arranged by Types A, B, and C 


mm 


Type A Studies, ITPA with Several Other Variables 

Center (1963). A 22 variable matrix, including PMA. 48 normal 
subjects, 8-9 years. One purpose was to determine ITPA factor 
structure. 

Haring and Ridgway (1967). Present investigator (C. E. M.) re- 
factored the data from intercorrelation matrix supplied by authors. 
Reduced matrix to 28 variables to avoid spurious factors due to 
ITPA total, etc. 106 Kg children screened as risks for learning 
disabilities. 

Horner (1967). With six Parsons Language tests, 15 variables, 
50 retarded subjects with wide IQ range. 

Loeffler (1965). Battery of 32 tests hypothesizing certain factors 
in 100 retarded children at 6.5 years MA. 

McCarthy and Olson (1964). Intercorrelation data based on the 
validity study. Based on 87 “linguistically normal subjects" CA 
7-81 years. (See Meyers below.) 

Meyers, Unpublished factor study based on 29 of the 51 variables 
of matrix given by McCarthy and Olson, 1964. Several analyses 
required, first to identify spurious factors due to re-use of data, 
then to achieve parsimonious result. 

Ryckman (1966). Matrix of 19 variables (one was CA) on 100 Negro 
Kg children. (Data to be distinguished from Ryckman’s listed under C 
below.) 

Strong (1964). Used 23 variable matrix with 200 institutionalized 
MR subjects with wide IQ range. 

Type B Studies, Using Selected ITPA Subtests with Many Other 
Tests to Implement Ability Factor Hypotheses ; j 
Carlson and Meyers (1968). ITPA Auditory-vocal sequential with 
two other auditory forward tests determined a clear auditory ben 
factor distinct from visual memory. Battery of 18 tests, 80 retard 
subjects of MA 4. PIE EEN 
McCartin and Meyers (1966). This 23-test study employ 
Auditory-vocal association and vocal encoding each with velis 
analogous tests to identify semantic ability factors. Each loade 
on factors as hypothesized, Auditory-vocal association with con- 
vergent semantic units, Vocal encoding with divergent production 
of semantic units. 100 normal children of 6 years CA. ds, oe 
Meyers, Sitkei, and Watts. (Unpublished data.) To help es 
auditory ‘memory factors, ITPA Auditory-vocal S CET 
entered with five other memory tests, determining an aua N 
memory factor. 20 tests given to 146 normal a at Ta » 
Sitkei and Meyers (1908). Employed ITPA Vocal encoding anı 
‘Auditory-vocal sequential in 22 test battery given 109 normal 
children of CA 4, to represent divergent aer nm and memory , 
f li tems, respectively, among other fac rs. 

Pun o Studs, Factoring Only the ITPA Variables, or Maybe One 
or Two Others 

Levanthal and Stedman, (1967). 
McCarthy and Kirk (1963). 
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Q Quereshi, (1967). Analysis at several age levels of 14 variable 
matrices, of which 10th of Binet MA and others were inferred factors 
based on intercorrelation inspection and theory. 


Code* Type C Studies, Factoring Only the ITPA Variables, or Maybe One 
or Two Others 
Ryckman, (1966). Presents in Appendix a restudy of McCarthy 
and Kirk (1963) and Semmel and Mueller (1965) reports. 
Semmel and Mueller, (1962). 


* This code used in Table 2. 


ponents extraction of factors which were then rotated to the vari- 
, max criterion (Kaiser, 1958). 

Results. Table 2 is a graphic arrangement of the factors con- 
sidered to be well determined from the appraisal of the 11 usable 
studies, and which have representation otherwise in the literature of 
factor investigations at this age level. The table shows letters in 
cells. Each cell is the intersection of the ITPA subtests with the 
proposed factors, which are in columns. "Taking column 1 for il- 
lustration, the letter C appears in three cells. C stands for Center's 
report, listed in the references. Center, among his factors, identified 
one which loaded on ITPA subtests 1, auditory-vocal automatic, 
4, auditory-vocal association, and 7, auditory-vocal sequential. 
Other letters stand for other studies, as coded in Table 1. 

The basis for placing a factor from another report in the same 
column with Center's factor was whether that factor saturated 
about the same pattern of subtests of the ITPA. Another criterion 
was whether associated reference tests clued the interpretation to 
the same location, Another criterion was negative in that, if a study 
produced a factor worthy of consideration (that is, it was not a 
singlet or a doublet without support of other tests), it would fit no 
other column, In all instances, with occasional problems between 
factors II and III, there was no other column in which to place an 
obtained factor. It was on this basis that Table 2 was constructed, 
and by and large the pieces fell congruently and rather neatly into 
place. Only saturations of .40 or greater were used, though satura- 


tions of lesser value Sometimes helped verify placement. 


Judging from this compilati i 
ipilation, the ITPA appears to tap six 
separate and established abilities, and possibl 


ete ly a seventh (factor 
VII being in doubt). To say what these factors are, the investigator 
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TABLE 2 
Established Factors Represented in the ITPA Subtests 


ITPA Subtest I p IH IV v VI VII 
, Auditory-vocal C, L, Mo H 
automatic Q, ES 
Visual decoding QRS CH 
LR 
| Motor encoding C,H, Q 
S, Mo 
Auditory-vocal C,H,L Mo 
Association Q, R, 8, mm 
Visual-motor R LQR CH 
| sequential Mo, 
Vocal encoding R CHQ L, S, mm . 
I ms, sm r4 
3$ Auditory-vocal C,Q,R H, L, Mo 
sequential cm, ms 
Visual-motor S C,Mo,Q 
association 
| Auditory decoding L, R, S Q C, H, Mo 
posed structure of CNM NMR CMR MFS MMS DS CMS 


intellect designation 


Note—Only loadings of 40 or more are indicated, provided there was support by reference vasahleg 
ts in columns refer to coded studies in Table 1. Factor VII is tentative. For factor in std 
ed names, see text. 


has drawn upon the considerable body of work establishing separate 
ability factors at these age levels, most of it done without reference 
to the ITPA, The common or descriptive names as well as the 
Guilford structure-of-intellect (SO1) lexical names are used. The 
results are presented also in Table 3. 1 
Factor I saturates test 1, auditory-voonl automatic and 4, audi- 
tory-vocal association in all cited studies but one; it variously 
saturates 7, auditory-vocal sequencing, 6, vocal encoding, and/or 
9, auditory decoding as well. In those studies having smaller ma- 
| trices it is called a “general language factor,” but in larger ma- 


trices the word “general” may not be deserved, for only test 1, 


auditory- i d 4, auditory-vocal association, are 
itory-vocal automatic and 4, sho that: this. obe 


commonly saturated. The reference tests conürm D 
Viously is the well-known verbal comprehension ability. Because it 
involves both cognition of language and inference in vcd inate it 
Carries a hyphenated SO1 label of cognitive-convergent semantics 


(C-NM). Di in factorially separating cognitive and con- 
Mei at the age level. Test 


verge: ilities has been common à 
gent language abiliti requires a symbolic 


1, auditory-vocal automatic, strictly speaking, 
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TABLE 3 
Summary of Relations between ITPA Subtests and Identi, ified Ability Factors 


ITPA Factor Nature of Tests and} 
Subtest Number Common Name Guilford SO1 Name Reference Tests 
1, 4* I Verbal Compre- C-NM, Cognitive- Picture vocabularies, 
hension Convergent semantics reading, other verbal 
tests 
2, 3, 6 I Vocal-motor NMR, Convergent Spatial inference tesi 
expression production of 
semantic relations 
2,5,8 IH Meaningful figural CMR, cognition of Meaningful figural- 
comprehension semantic relations pictoral relations 
5 IV Immediate visual MFS, memory for Visual span-type t 
memory figural systems B. 
6 VI Vocal expression D$, divergent Vocal description, 
semantics examples, elabora4 
tion 
7 Y Immediate auditory MSS, memory for Auditory span-type 
memory symbolic systems tests—letters, wor 
digits 
9 VII Vocal decoding, CMS, cognition of Discrimination of 
receptive language semantic systems meaning of lang! 


of others 


* Also tests 6, 7, 9 in some studies. 


rather than a semantic function, or perhaps involves both in small 
children. The studies cited here, Ryckman’s excepted, were not 
capable of separating semantic from symbolic content. But Ryck- 
man’s reference variables were not univocal enough to permit a 
separate factor identification for test 1. 

Factors II and III are chiefly visual motor. Two involves encod- 
ing or expressional process. Test 6, vocal encoding, appears in this 
factor, and while it is auditory-vocal, it is also motor-expressive. 
“Vocal motor expression” is a sufficient common name. Test 2, visual 
decoding and 3, motor encoding, are the two visual tests loading 
on Factor II. Both employ meaningful figural material but do not 
require perceptual discrimination as such, so they are nonverbal 
semantic rather than figural. Such tests in Guilford’s system almost 
always load with semantic tests using language. Thus II can be 
labeled convergent production of semantic relations, or NMR. 
Factor III saturates tests using meaningful figural material (test 
2, visual decoding and 8, visual-motor association, not to mention 
5, visual-motor sequential). The first two tests require comprehen- 
sion of the Meaning of the material used and relations among 


L. didi —_ 
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them; the factor is given the Guilford label of cognition of semantic 
relations, CMR. d 

Factor IV is consistently identified by four competent studies. It 
saturates test 5, visual-motor sequential, a visual motor test, and 
its nature is confirmed by the important reference tests. One might 
note that test 5 loads separately from the auditory memory test, 7, 
auditory-vocal sequential. Such clear segregation of auditory and 
visual memory has been found in several non-ITPA studies (e.g., 
Meeker and Meyers, 1969; Orpet and Meyers, 1966). The Struc- 
ture of Intellect name is memory for figural systems (visual), MFS. 

Factors V and VI, though each saturates but one ITPA subtest, 
are well substantiated in other literature and are undeniable here. 
Factor V, saturating 7, auditory-vocal sequential, does so in all five 
studies which employ other auditory memory tests. There is no 
doubt that it is Guilford’s memory for symbolic systems (auditory) , 
MMS. Factor VI saturates test 6, vocal encoding. In all five of the 
matrices which provided for other divergent semantic activity, a 
factor is formed, identifiable as divergent semantics, DS. 

Factor VII appears in three of the soundest studies. Not well 
supported in interpretation by reference tests, its repeated appear- 
ance cannot be entirely denied. Saturating only test 9, auditory 
decoding, it appears to be a simple cognition of semantic systems, 
OMS. 

Discussion. The fact that the ITPA seems to reflect these six or 
seven abilities does not mean it measures them all well. At least 
threo (V, VI, and VII) are underdetermined by the ITPA, and are 
brought forth in factor analyses only with the support of reference 
tests. There is not enough test reliability in tests 6, 7, or 8, each 
by itself, to give a usable score in the quality it samples. But the 
power to do so is latently there, waiting to be developed either with 
a more reliable test or with another subtest. Similarly, if it were 
desired to bring out a factor representing the child's developed 
ability in use of grammatie rules (the potential symbolic factor 
represented by subtest 1, auditory-vocal automatic) such will re- 
quire more than this one subtest. Otherwise, the ITPA appears to 
Measure separately with some reliability the verbal comprehension 
factor I, also called cognitive-convergent semantics, and two visual 


motor factors, II and III. 4 
"The results should also be interpreted in terms of the intended 
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categories of functions which the ITPA presumes to sample. These 
are as follows: (a) Channels of communication (auditory-vocal 
and visual-motor), (b) Levels of organization (representation, as 
in word knowledge exhibited in the decoding, encoding, and asso- 
ciation subtests; and automatic-sequential, as in grasp of rules for 
plurals, and ability to observe and retain a symbol sequence, or so- 
called immediate memory), and (c) Processes (encoding, decod- 
ing, and association). The question is, were the obtained factors 
representative of any of the above structure? It is obvious that 
channel separation took place. Factor I saturated auditory vocal 
tests, Factors II and III saturated visual motor tests (but II also 
saturated vocal encoding, test 6), while III seems purely visual. 
Factor IV is a purely visual memory ability, and factors V, VI, and 
VII are limited to single channels, Hence channel separation oc- 
curred. It is necessary to make an aside on the nature of factor I, 
verbal comprehension or cognitive-convergent semantics. The 
ITPA does not measure comprehension of language through the 
reading process, which would be through the visual channel, that is, 
a feat of literacy. The subjects for whom the ITPA is intended to 
be used are of course preliterate. 

As to the representational level of organization, this is supposed 
to be sampled in any tests where word meaning is required, and 
hence is definitionally equivalent to "semantie" or "verbal com- 
prehension.” Factor I shows the function, even though one might 
have doubts about the nature of tests 1 and 6, which load on it. 
Does this representational level segregate itself from the automatic- 
sequential level? Tests 1, 5, and 7 are designated by the authors as 
reflecting the latter. The experienced psychometrist and factorist 
will instantly detect that automatic and sequential or short-term 


memory processes are Psychologically different from each other. 
The three tests are to 


To sum the above, "representational level” is found, and merely 
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represents other words for verbal comprehension. “Automatic” is 
lost, and “sequential” divorces from automatic, being merely short- 
term memory, which separates by channel into two abilities. 

Finally, are the factors related to the process differences? There 
are two decoding tests, 2 and 9. These, however, do not co-load; 
each goes by way of its auditory-vocal or visual-motor orientation, 
not by the decoding process. The two encoding tests, 3 and 6, also 
fail to associate factorially, 6 however becoming a divergent factor 
sui generis (factor VI). The association tests are 4 and 8. Again, 
their factor saturations are determined otherwise than by asso- 
ciation. The process orientation of psycholinguistics, thus, is com- 
pletely unsubstantiated in these factorial results. 

If one reads across the table rather than up and down, he will 
observe that most of the subtests appear to load on more than one 
factor. There is no law requiring a subtest to be univocal. It is 
possible that second order factors, representing broader domains 
such as auditory vs. visual or according to levels as in the original 
theory, might be identified in a proper study. The present informa- 
tion is not capable of generating such second order rubrics. 
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THE VALIDITY OF DIFFERENT STRATEGIES OF 
SCALE CONSTRUCTION IN PREDICTING 
ACADEMIC ACHIEVEMENT: 


HUBERT J. M. HERMANS? 
University of Nijmegen, The Netherlands 


Tun aim of this study was to compare different strategies of 
scale construction using a sample of children of elementary school 
age. The criterion variable in this study was the grades of the stu- 
dents. The relevant predictor variable was the achievement mo- 
tive. Three other variables, namely debilitating anxiety, facilitat- 
ing anxiety, and social desirability were included to investigate 
the discriminant validity of the achievement items. All these vari- 
ables, except social desirability, have shown some degree of cor- 
relation with various criteria of study success in previous empirical 
investigations (e.g., McClelland, Atkinson, Clark, and Lowell, 1953; 
Alpert and Haber, 1960). 

In earlier investigations (Hermans, 1967), the TAT procedure of 
McClelland and a multiple-choice questionnaire were compared in 
validity against performance criteria, in laboratory and field stud- 
ies. In both situations the questionnaire method was more valid 
than the TAT (Hermans, 1967). Consequently, the questionnaire 
method was chosen in this study for measuring the achievement 
motive (AM), facilitating anxiety (FA), and debilitating anxiety 
(DA). To control the influence of stylistic variance, ji social de- 
sirability scale (SD) was included in the questionnaire. All four 


1 Pa International Congress of Applied Psychology, 
NITRE Ad programs for these analyses were Miu 
by Dr. E. F. Ch. I. Roskam and H. E M. Borgers of thhe Department o 
Mathematical and Statistical Psychology at the University of Nijmegen. a 

2The author wishes to thank Dr. Lewis R. Goldberg and Dr. Frank D. 
Payne for their valuable advice and help with the translation. 
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scales were composed of multiple-choice items. Previous experience 
with these four variables with adult subjects has shown that of the 
four scales, the one measuring the achievement motive is the most 
valid predictor of school grades; the anxiety variables give lower 
correlations, and SD rarely shows a correlation with grades (Her- 
mans, 1967). 

Procedure. A questionnaire with 94 items was given to the 5th 
and 6th grade classes in two elementary schools in The Nether- 
lands. The questionnaire contained 35 items for the achievement 
motive, 14 items for debilitating anxiety, 17 items for facilitating 
anxiety, and 28 for SD. These items were drafted by the author 
and were based on a review of the literature and empirical experience 
with adult Ss. The review of previous findings was used to guarantee 
the substantive validity of the initial item pool (Loevinger, 1957). 

Strategies. The following strategies for scale construction were 
used: 

The intuitive scales: The initial item pools for the achievement 

motive, debilitating anxiety, and facilitating anxiety were used in 
their original form without any kind of internal consistency analy- 
sis. The scale score was an unweighted summation of item scores. 
The item score was a one or zero, split as nearly as possible to the 
median, based upon the response frequencies to each of the mul- 
tiple-choice alternatives. The scales constructed in this way were 
called intuitive scales because judgments regarding the suitability 
of an item for inclusion in the scale relied solely on the cognition of 
the test developer (see Hase and Goldberg, 1967). 
. The intuitive-internal scales: Each of the items in the initial 
three item pools was correlated with its respective scale scores, and 
items correlating .20 or greater with their respective total scores were 
retained to form the three intuitive-internal scales. Only those 
items which were in the corresponding original intuitive scale 
Were selected, and thus the three internal scales—like the three 
intuitive scales—contain no overlapping items. 

ne intuitive-internal-discriminant scales: The aim of the dis- 
criminant strategy was to make the scales as independent of each 
Other as possible. Each of the items from the internal scales was 
correlated With the scores from each of the other (internal) scales, 
and items with correlations of 25 or less with all of the other three 
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scales were retained to form the three discriminant scales? Again, 
as with the intuitive and internal scales, the discriminant scales 
were necessarily non-overlapping. 

Subjects. Sample A combined a 5th grade class (N = 33) and a 
6th grade class (N = 17) from one primary school (N = 50). 
Sample B was a 5th grade class from another primary school (N 
— 38). Sample C was a 6th grade class from the same school (N — 

36). The samples were combined for scale construction purposes 
(N — 124). 

Analyses. In each of the three samples, correlations were com- 
puted between the scales constructed by the three different strategies 
and an unweighted sum of the students’ grades in the following 
subjects: arithmetic, Dutch language, history, geography, gymnas- 
tics, reading, and writing. The internal consistency of predictors 
and eriterion was estimated by Kuder-Richardson Formula 20 (KR- 
20). Within each sample, grades were divided at the median; above 
the median and below the median grades were transformed to ones 
and zeros, respectively. The resulting matrix of ones and zeros 
permitted a comparison between the internal consistency estima- 
tions of the grades and the predictors, and permitted the validity 
coefficients to be corrected for attenuation. Since the scales varied 
in the number of items included in them, an estimate of the inter- 
nal consistency for the average items in each scale (rx) was also 
computed, by reversing the Spearman-Brown prophecy formula: 


e Tee 
Tu = —[n- Dre] 
Where n equals the number of items in the scale and re the re- 
liability of the scale estimated by formula KR-20. 

Results. The internal consistencies and validities of the scales 
constructed by the different strategies are presented in Table 1 The 
intuitive and the internal strategies can be compared only in the 
case of the achievement scales, because of the lack of inconsistent 
- items on the other two scales. While the intuitive achievement 
~ scale contained about 50 per cent consistent (high correlations with 

total score) and 50 per cent inconsistent (no correlations with total 


UN 
—— 2i nxiety where & limit of 30 was 
8 An exception was made for debilitating d items that elimination of 


“chosen, This scale had so many nondiscriminating i r 
these items would have reduced too severely the number of scale items. 
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score) items, the debilitating anxiety intuitive scale had only con- 
sistent items, and must therefore produce the same correlations 
as the internal one. The intuitive scale for facilitating anxiety had 
only one inconsistent item and was therefore practically the same 
as the internal seale. Consequently, the intuitive scales for the 
anxiety dimension were not compared with the internal scales. For 
the SD dimension, only the internal scale was of interest because 
this variable is not a relevant predictor, but rather a controlling 
stylistic variable. 

The most prominent findings can be summarized briefly: (a) The 
internal scales were more internally consistent, as measured by 
formula KR-20 and the reversed Spearman-Brown formula ru, 
than the intuitive ones, an inevitable result of the scale con- 
struction strategies. (b) The internal strategy produced a more 
valid achievement scale than the intuitive one. The statistical sig- 
nificance of the differences in validities was calculated in each of 
the three samples and the resulting probability levels were com- 
bined according to the procedure described by Jones and Fiske 
(1953). Although the difference between the correlations for the 
two strategies was not significant (composite probability was p < 
10), the internal strategy was consistently better in the three 
samples than the intuitive one. (c) The discriminant strategy pro- 
duced a slightly more valid achievement scale than did the internal 
strategy. This result appeared in two of the three samples. (4) 
Although the discriminant achievement scales were the least in- 
ternally consistent of the three types, they were the most, valid. 
This implies that the non-diseriminant items of the internal 
scales added to the internal consistency of the scales, but decreased 
their validity. However, the differences in validity are so small 
that they may be due to chance. Comparing the internal and dis- 
criminant anxiety scales, the internal ones were more internally 
consistent. In terms of validity there were no consistent differences. 

Tables 2 and 3, which present the intercorrelations among the 
scales, show that the discriminant scales have higher diseriminant 
validity than the internal ones, as would be expected on the basis 
of the scale construction strategies. 

Discussion. The finding that the internal s 
More valid scale than the intuitive one seems ic 
recent study by Hase and Goldberg (1967), whi 


trategy produced a 
to contrast with a 
ch found no dif- 
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TABLE 2 | 


Intercorrelations (Spearman) between the Scales in Sample A (Below the 
Diagonal) and in Sample B (Above the Diagonal) 


T 2 3 4 5 6 7 8 


. AM intuitive 86  .74 —.19 —.16  .23  .12  .06 
AM internal .90 89 —.25  .19 .16  .06  .87 
AM discriminant 88 95 —.01 —.01 .00 —.09  .36 
DA internal —.294 —.45 —.42 . 3 —.50  .04 —.08 
DA discriminant 05 —.12 —.08 .66 —.31  .07 —.18 
FA internal .46 .54 -50 —.58 —.28 .68 .16 
FA discriminant ee LogU 133  —.26 —.08 . .77 QUO 
SD internal 28. .19  .12 —.00  .00  .13 11 i 
SD discriminant E 0.19 —.04. .01 88 


ferences in validity between internal and intuitive scales. In 
fact, however, there is an important difference between the two 
studies. While Hase and Goldberg studied the validity of scales pro- 
duced by various strategies across a wide range of different 
criteria, the present study was concerned with the construction 
of one scale for one particular kind of criterion (achievement). 
Thus, while the internal achievement scale was more valid in the 
present study than the intuitive one, this is no guarantee that the 
former strategy would produce more valid general individual dif- 
ference variables, à 

Another difference is that the internal scale of the present study 
included only items which were also included on the intuitive 
scale. In fact, the internal scale forms the consistent part of the 
intuitive one. In the study done by Hase and Goldberg, the in- 
ternal and intuitive scales were constructed independently of each | 


2. AM internal +88 

3. AM discriminant -79  .90 

4. DA internal —.31 —.40 —.36 

c Pa discriminant, —.07 —.16 —.09 .62 

TFA nternal 1 .46 .43 .388 —.39 —.05 

8 gp discriminant m 6 2 —.07 12 .76 

r internal nal * 1 á —.29 —.25 .35 .23 

9. SD discriminant Al 3581-119 — 24 196 .22 .92 
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other, so that the internal scales could have additional items 
other than those included in the intuitive scales. 

These distinctions between the two studies suggest that the in- 
ternal strategy may produce higher validities than the intuitive 
strategy only in the case where those items retained for the in- 
ternal scale were initially included in the corresponding intuitive 
scale. 
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SEMANTIC DIFFERENTIAL RATINGS AND 
THE RANK-ORDERING OF VALUES! 


ROBERT HOMANT 
Michigan State University 


Tux purpose of this study was to investigate the relationship be- 
tween the semantic differential (Osgood, Suci, and Tannenbaum, 
1957) and the values which Rokeach has included in his ter- 
minal and instrumental values scales. Rokeach (1968) has defined 
a value as “an enduring belief that a specific mode of conduct or 
end state of existence is personally and socially preferable to al- 
ternative modes of conduct or end states of existence.” A ter- 
minal value is a belief about an end state of existence (eg, & 
comfortable life, an exciting life, a sense of accomplishment), and 
an instrumental value is a belief about a mode of conduct (e.g. 
ambitious, broadminded, capable). In order to measure the rela- 
tive importance of values to individuals, Rokeach selected 18 ter- 
minal and 18 instrumental values for use in two separate rank- 
order perference scales. Subjects were instructed to rank the values 
in order of their importance in their lives. vis 

The ranking technique has the advantage of forcing the individual 
to generate a value system. This is important if one hypothesizes 
that behavior is determined by the relative (rather than the abso- 
lute) importance of a person’s values. However, questions may be 
raised about the reliability and validity of such & technique. 
Penner, Homant, and Rokeach (1968) have found that the reli- 
ability of Rokeach's value scales is comparable to that obtained 
by a more laborious paired-comparison method of measuring those 
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values. The present study is concerned with the construct validity 
of the values as measured by Rokeach’s ranking method. 

Hypothesis one: if Rokeach's value scales measure personal pref- 
erence for values, then subjects’ ranking of the values should be 
correlated with the evaluative dimension of the connotative mean- 
ing of those values. 

Hypothesis two: subjects’ ranking of the values should not be 
correlated with the potency and activity dimensions of connotative 
meaning. : 

It should be noted that with a random selection of semantic 
differential scales, potency and activity factors do not always 
emerge in factor analysis because of concept-scale interaction. 
Therefore, in order to achieve wider application for this study, it 
was thought best to select only those scales which typically mea- 
sure the three traditional aspects of connotative meaning. 

Osgood, Suci, and Tannenbaum (1957) have shown that the 
evaluative dimension of the semantic differential functions as an 
attitude measuring technique. In this sense, measuring Rokeach's 
values on evaluative semantic differential scales could be considered 
& more time-consuming (and indirect) method for measuring 8 
person's values. Even theoretically, however, we do not expect a 
perfect correlation between value rank and semantic differential 
tating because it is possible that on a “purely cognitive” level a 
person would feel that a particular value (e.g., equality) is quite 
important, even though on a more emotional level—presumably re- 
flected in the semantic differential—he had some negative feelings 
about the value. For the most part, however, we assume that there 
should be a close Telationship between a person’s cognitive evalua- 
tion of a value and his semantic differential rating of it. 

Method. Fifteen Semantic differential scales were selected from 
previous work by Osgood et al. (1957) and by Osgood, Ware, and 
Morris (1961). Five scales were selected to reflect each of the three 
traditional aspects of connotative meaning. For evaluation: beauti- 
ful-ugly, kind-cruel, good-bad, positive-negative, and success- 
ful-unsuccessful. For potency: heavy-light, hard-soft, large-small, 
strong-weak, and masculine-feminine. For activity: excitable-calm, 
fast-slow, hot-cold, active-passive, and vibrant-still. 
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one value at a time, with scales counterbalanced as to direction and 
presented in random order. The instructions for the semantic dif- 
ferential were adapted from Osgood et al. (1957). Forty-four other 
subjects followed the exact same procedure for the instrumental 
values. 4 

Results. Semantic differential ratings were first factor analyzed 
to determine whether potency, activity, and evaluative factors had 
been obtained. Mean correlations were computed between each of 
the sets of five scales and the factors of a three-factor varimax 
solution. As can be seen in Table 1, each set of scales correlated 
highly with one of the factors. (Each factor is named according to 
the set of scales that correlated most highly with it.) However, the 
potency and activity scales also showed low positive correlations 
with the evaluative factor. 

Since this could confound our test of the second hypothesis, a 
subset of two scales per factor was selected to maximize indepen- 
dence with the two factors which the scales were not intended to 
measure. These scales were good-bad and kind-cruel for evalua- 
tion, hard-soft and heavy-light for potency, and excitable-calm 
and vibrant-still for activity. The comparable data for these 
subsets are also given in Table 1. 

For each subject, correlations were obtained between his value 
rankings and the sum of his semantic differential ratings for each 
set of scales, These results are given in Table 2. Looking at the re- 
sults for the sets of five scales we see that there js a significant 
correlation between the ranks of the values and each of the factors. 


TABLE 1 
Mean Correlations between Three Sets of Semantic Differential Scales 
and Three Factor Solution 
Factor E 
Scales Evaluation Potency Activity 
(Original sets of five scales.) 

Evaluation -74 -.13 t 4 
Potency .28 .60 760 
(Sets of two scales selected to maximize 
independence with other factors) a 18 
Evaluation .74 EX 7A 105 
Potency .00 e 73 
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TABLE 2 


Median Correlations between Terminal and Instrumental Value Scales 
and Three Semantic Differential Factors 


Factors 
Evaluation Potency Activity 


(Sums of semantic differential ratings for values 

are based on original sets of five scales) 
"Terminal values .68 36 .45 
Instrumental values .62 .46 .82 


(Sums of semantic differential ratings for values 
are based on subsets of two scales selected to 
maximize independence with other factors) 
Terminal values 3 16 .24 
Instrumental values 47 .22 .10 


and .26, respectively. 

The correlations with the potency and activity factors, however, 
are greatly reduced when the purer subsets of two scales are used, 
although a significant correlation between the terminal values and 
the activity factor remains. 

Discussion. The first hypothesis is clearly supported, thus lend- 
ing some construct validity to Rokeach’s value scale. With respect 
to the second hypothesis, potency and activity seem to be only 
minor determinants of the tanking process for most subjects. The 
median correlations which were obtained for potency and activity 
(see Table 2) can be attributed to chance. However, these correla- 
tions do suggest another Possibility which we can mention here only 
briefly. 

First of all (as Suggested in a similar context by Osgood et al., 
1961) there may be a preference for strong and active ways of 
life in our culture. Thus with a different culture—or perhaps even 
with a different sample of values—these correlations might vanish 
or even reverse themselves. 

ih concluding, it should be noted that some striking individual 
differences were necessarily overlooked in reporting the results. For 
some subjects the correlation between their values and the potency 
or activity scales was as high as .88 (even with the two-scale 
subsets) and correlations with evaluation dipped as low as —.81. 
Such discrepancies, though relatively rare, could bear investigating. 

Summary. Fifteen Semantic differential scales were used to mea- 
sure the connotative Meaning of Rokeach’s values. Eighty-eight 
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"subjects filled out either Rokeach's terminal or instrumental value 
‘scale, and then rated those values on semantic differential scales. 


Median rho correlations between individual subjects’ value rank- 
ings and semantie differential ratings on evaluative scales were 
68 and .62 respectively. The scales measuring potency and activity 
were only slightly related to value preference. 
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CONCURRENT VALIDITY OF A BRIEF 
TEST OF ACADEMIC APTITUDE 


LARRY C. KERPELMAN 
University of Massachusetts 


K 


Tux Control Test AA (CTAA) is an instrument originally de- 
signed to provide researchers with a measure of aptitude or ability 
in order to control for that variable in making comparisons on 
other characteristics among groups of college students (Peterson, 
1965). The test contains nine antonym items, 12 quantitative com- 
parison items, and nine verbal analogy items. It is group-ad- 
ministered and takes little time (12 minute time limit). In an 
initial study of the predictive validity of the CTAA, Peterson 
(1968) reported Pearson product-moment rs between CTAA scores 
and Student-reported freshman grade point averages at three dif- 
ferent independent colleges of 40 (N = 225), 51 (N = 332), and 
39 (N = 106). The present paper reports a series of studies cor- 
relating the CTAA with both grades as well as with two other 
Measures of intellectual ability. 

Experiment I. As part of an undergraduate psychology honors 
Project on a topic unrelated to the CTAA or intelligence, 108 stu- 
| dents, predominantly freshmen, from the Introductory Psychology 
Subject pool at the University of Massachusetts were administered, 

in groups of 20 to 30, the CTAA.! The students participated in the 
experiment for required course credit. As part of the procedure, but 
Bot contiguous with the CTAA administration, Ss were requested 
j to provide their cumulative grade point average (GPA). Mean 
CTAA score was 21.92, SD = 3.23; mean GPA was 2.41, SD 


T EE 


E———— —— i i i 
il The author wishes to thank the experimenter, Eric Wish, for including 

‘the CTAA in his test battery. 
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= 0.63. The Pearson r between students’ reported GPAs and their 
CTAA scores was .33 (p < .001). 

Because the Ss in this experiment were predominantly freshmen, 
it could be argued that their GPAs did not have the opportunity to 
reach stability, i.e., that their GPA was more a reflection of fresh- 
man “upset” and less a reflection of ability. Furthermore, GPA 
may not be so valid an index of ability or aptitude as may be a 
more direct measure of intellectual capacity. Consequently, in the 
next experiment, a standard intelligence test and the CTAA were 
administered to college students and their scores correlated. 

Experiment II. The experimenters (Zs) in this experiment were 
13 first year graduate students in clinical psychology at the Uni- 


versity of Massachusetts. As a requirement of the author's grad- : 


uate course in methods of assessment, they gave three practice 
administrations of the WAIS (Wechsler, 1958) to college students. 
In their second and third administrations, Es also gave the CTAA. 
Subjects were 23 Students, predominantly freshmen, from the In- 
troductory Psychology course at the University of Massachusetts 
who partieipated in the experiment on a voluntary basis for 
optional course credit, All tests were administered individually. In 
half the administrations, the CTAA was given first; in half, the 
WAIS was given first. The OTAAs were scored by a research 
assistant; the WAISs were scored by the clinical graduate students, 
and their scoring was checked and corrected, if necessary, by the 
course teaching assistant, 

The mean score when the CTAA was administered first was 21.09 
(SD = 348, N = 11); when administered second, it was 21.17 
(SD = 2.95, N = 12). The WAIS mean full scale IQs were 118.73 
(SD = 5.69, N = 11) when the WAIS was given second and 115.08 
(SD = 5.42, N = 12) when given first. Since there appeared to be no 


order effect, the data for all 23 Ss were combined, with a resulting 
CTAA-WAIS Pearson rof.34(10<p< 30). 


A possible criticism of this experiment is that the WAIS IQ 


Scores, even though checked for accuracy by an advanced graduate 
E may not be valid indices of intellectual ability because of 
e mexperience of the Es in administering individual intelligence 


tests. Consequently, in the next i ienced E 
wile did D experiment & more experience 


x 
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Experiment III. As part of a large project on student activism? 
the author ineluded in a battery of group-administered question- 
naries the CTAA and the Borgatta and Corsini (1964) Quick Word 
Test (QWT), form Am, level 2. The latter is a rapidly administered 
100 item test of verbal ability that has no time limit, but generally 
takes about 15 minutes to complete. The QWT correlates between 
62 and .69 with other group-administered tests of mental ability, 
and between .58 and .78 with various subtests of the ACE Lin- 
guistic Test (Borgatta and Corsini, 1964). 

The Ss in this experiment were college students at three institu- 
tions of higher learning located in the northeastern US. Institu- 
tion A was a small, coeducational, independent liberal arts col- 
lege; Institution B was a medium-sized, coeducational, independent 
university; and Institution C was a large, coeducational public uni- 
versity. Since some Ss belonged to political activist organizations 
and some did not, since the Ss were not Introductory Psychology 
volunteers but rather were paid volunteers, and since, consequently, 
they were from all year levels at their institutions, these samples 
were thought to be somewhat more representative of the total 
general college population than were the samples in Experiments 
Iand II. 

The Ss were met with in groups of two to 35, with the author 
Serving as E in all instances. The students received a booklet 
containing questionnaires, in which the first instrument was the 
CTAA and the sixth one was the QWT. After completing the timed 
CTAA, Ss worked at their own rate of speed in completing the 
other research instruments, including the QWT. 

The results of this experiment are presented in Table 1. As can 
be seen, the correlations ranged from .25 to 57. 

Discussion. With more careful and appropriate measures ad- 
ministered from Experiment I through Experiment III, the oor- 
relations of the CTAA with a concurrent measure of aptitude in- 
creased. The one exception was at Institution A in Experiment III, 
Where the correlation was a small .25. It is evident from the QWT 
means that students at Institution A are very intelligent. It is 


elfare, The opinions exp! 
of the U. S. Office of Education, and no 
should be inferred, 
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TABLE 1 
Means, SDs, and rs of Control Test AA and Quick Word Test Scores 


CTAA QWT 
Measure Mean SD Mean SD r 

Instituti 5 75.93 11.86 T2 

titution A (N = 85) 27.05 1.88 
Institution B N = 92) 25.23 2.55 65.70 oS po 
Institution C (N = 123) 22.83 3.37 58.46 14.3 . 

*p <.02. 
** » < .001. 


further suggested from the CTAA mean at Institution A that there 
was a “ceiling” effect which very likely attenuated the correlation 
at that school. Although Peterson (1965) reported an absence of 
any perfect scores leading to a “ceiling” effect in his original vali 
dating samples, and only two out of 25 students at the “hest 
institution in his sample achieving 29 out of 30 correct on the 
CTAA, the corresponding numbers at Institution A in the present 
study were six and 16 (out of 85 students). It is apparent from 


examination of the means for Peterson’s samples that none of them 
approached Institution A in 


the present study in measured mental 
ability. 


These data suggest, that caution should be taken in using the 


CTAA at schools where the students are highly selected and of 


superior ability. An alternative Suggestion would be to lower the 
time limit of the CTAA 


in an attempt to raise the “ceiling.” With 
this one exception, the results of the present study of the concurrent 
validity of the Ç ^ 
icti » With correlations generally of 
de. Peterson's cautionary note that the test is 
Tesearch purposes but is not appropriate for as- 


Sessing individual students is reinforced by the results of the pres- 
ent study. 


the same magnitu 
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PREDICTIVE VALIDITY OF THE 
MATHEMATICS PLACEMENT EXAMINATION 


LINDA R. SHEVEL AND DOUGLAS R. WHITNEY 
The American College Testing Program 


In the fall of 1968, The American College Testing Program (ACT) 
t, the Mathematics Place- 


announced the availability of a new tes 
meny Examination (Exam), for use in college placement. The Exam 
consists of two 50-minute parts and provides scores in Intermedi- 
ate Algebra, College Algebra, Trigonometry, and Special Topics. 
Development, reliability, validity, and suggested uses of the Exam 
are described in detail in the Manual for the ACT Mathematics 
Placement Examination (Shevel and Whitney, 1968). 

Purpose. Student ACT scores and high school grades are nor- 


mally available to colleges and universities prior to student regis- 
tration and can be used in placement. The primary purpose of this 


study was to determine whether the Mathematics Placement Exam- 
ination offered suficient improvement in the prediction of college 
mathematics grades to warrant its addition to & college’s testing 
program. A secondary purpose Was to compare the differential pre- 
ation wit 


dictive validity of the Mathematics Placement Examin 
for classes that differed 


that of ACT scores and high school grades i 
in average mathematics ability and for classes that covered dif- 


ferent types of material. N 
Predictive validity of the Math Placement Examination and the 
"s Standard Research Ser- 


TH Index, Colleges participating in ACT 
tive data for overall grades and grades 
ican College 


vice (Plan A) receive predic 
hich they choose (The Ameri 
Testing Program, 1969)- Their Plan A reports, then, include & 
multiple correlati een predictors (ACT scores 
and high school grades) and college mai 
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relation, called the TH Index because it utilizes both ACT test 
scores and high school grades, provides grade predictions in college 
mathematics. The TH Index might be considered the standard 
ACT prediction. Since ACT-participating colleges already have 
such information readily available, it is important to determine 
how much the Mathematics Placement Examination adds to the 
TH Index. 

Colleges included in the sample and their course offerings in 
mathematics are described in the Manual for the ACT Mathe- 
matics Placement Examination (Shevel and Whitney, 1968). For 
each mathematics course for which complete ACT data were 
available, we computed the multiple correlation between first semes- 
ter mathematics grades and six variables (ACT Mathematics, 
ACT Composite, and the four ACT high school grades). We 
also computed the correlation between first semester college mathe- 
matics grades and ten variables (the above six plus four scores of 
the Mathematics Placement Examination). 

On the basis of course title, textbook used, and description of 
the mathematics course provided by the college, we grouped these 
courses into six content categories. For each class in each category, 
the standard ACT prediction equivalent (the six-variable) and the 
standard ACT prediction equivalent plus Math Placement scores 
(the ten-variable) multiple correlation coefficients were computed 
and the differences between the two were noted. Table 1 presents 
the results of these analyses. These results indicate that the two 
correlations are similar for introductory level courses, but that the 
augmented multiple correlation is markedly larger for upper level 
courses, 

We conclude that the Math Placement Examination adds more 
to the prediction of the TH Index for upper level courses. The 


increase in predietion is substantial for calculus and honors 
courses, 


i we OR ; 
a Di TE iae ned that the six-variable R could be considered 


As a regular part of the ACT À 
grades in English ~~ assessment, students report their most recent 
A » mathematics, social studies, and natural sciences. 
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TABLE 1 


Comparison of Standard ACT Prediction with Standard ACT Prediction 
Plus Mathematics Placement Tests 


Standard ACT 
Standard ACT Prediction Plus Median 
Prediction Math Placement Increase 
Class Categories Median Median InR 
General Math—Business 
Math (3 classes) .561 .585 .051 
Intermediate Algebra 
(4 classes) .564 .630 .074 
—— College Algebra (6 classes) .526 .639 .086 
Trigonometry—Analysis 
(6 classes) .632 714 .105 
Calculus (4 classes) .532 742 .206 
Honors (1 class) .375 -771 .896 
"Note.—We computed means, standard deviations, and median correlations with college 
5 


difference between the two median correlations reported. 


Predictive validity for differing course content. In order to de- 
termine whether the Mathematics Placement subtest scores and 
other frequently used predictors possessed differential predictive 
Validity with respect to the content of mathematics courses, we 
computed means, standard deviations, and median correlations 


With college mathematics grades for each of thirteen variables re- 
lated to college grades. These computations were made for each 
. The thirteen variables in- 


course for which we had complete data 
Placement Exam, the 


cluded six scores of the ACT Mathematics 
mathematics score and composite score of the regular ACT battery, 
English, mathematics, social 


self-reported high school grades in a 
Studies, natural science, and a high school grade point average 


computed from these four grades. These results are presented in 
— Table 2. 
pears to have greater pre- 


The Mathematics Placement Exam ap Hu a 
 dictive validity for higher-level courses than for remedia courses. 


Thus, the Mathematics Placement Examination seems to be most 
- useful, relative to other predictors, for the suggested use in place- 
- ment into higher-level courses. 

Predictive validity for differing a 


E LS 2 


bility levels. To examine the 
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possibility of differential predictive validity for classes with dif- 

ferent ability levels, we divided the classes into three groups: those 
| classes with mean ACT mathematics scores less than 20.0; those 
with mean ACT mathematics score 20.0-24.99; those with mean 
ACT mathematics score greater than 24.99 (approximately one 
standard deviation above the mean). Table 3 presents the means, 
standard deviations, and median correlations between college 
mathematics grades and each of the thirteen variables previously 
described. 

Table 3 shows that the Mathematics Placement Examination is 
no more effective for predicting first semester college mathematics 
grades for lower ability classes than are ACT scores or high school 
grades. For the highest ability classes, however, the median corre- 
lations for the Intermediate Algebra and Total Scores are higher 
than those of the other predictors. Tt would appear, then, that the 


TABLE 3 


| Means, SDs and Median Correlations of Thirteen Variables with First Semester College 
athematics Grades for Three Ability Groups 


Low group "Middle group High group 
ACT Msth ACT Math ‘ACT Math 
less than 20.0 20.0-24.99 25.04- 
leen He TREE 
9 classes) (6 classes) (8 classes) 
Variable me SD med.r Mean SD med.r Mean SD med.r 
Og, grade Les 1:19: 0:00] austere. 25 DU 2.00 1.30 1.00 
I-A 
scores 
45 
Int. algebra er ate E LIU Aa B2 15.29 4831 - 
4^ Clg. algebra 3:35 2.04 —.05 404 2.2  .19 bis an a 
Total algebra 11.02 4.32 30 $14 59 89 20-88) OQ ea 
Trigonometry "(avi MS E om 5 58 PES as 
Total score 13.75 5.14) 8175 /19:96 7.50 40 9540 a sa Pi a8 
| Spec. topics Seb 128, 0500 ara A288 ‘32 0340 L 
} ACT scores ^ 
{| ACT Math isai 440. 88, 232 EE “ 2x n in 
y ACT Comp. 19.40 3.08 35 22.29 3.55 -3 . 
| High school grad i 
| HS Engish 2.79 084 2 sos S 5 M 
ao 2.30 0.98 P 81 091 -3% gos 0.78 28 
HS soc. studies — 2.03 0.81. 590 06 0.8 3 i ose 30 
HS nat. science 2.59 0.89 — 29 73 0.89 > 29. a5 
HS gpa 2.67 0.00 -35 88 0.67 - :96 0. 
- - severely understated since 


i be 
Mi dhematics and college grades "AY 
wi define these ability levels. 


Note.—The correlations between ACT 
ACT Mathematics was used as the criterion 
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Mathematics Placement Examination does work better than other 
predictors for high ability students.2 Even for the high ability 
group, however, the highest correlation (either the Intermediate 
Algebra or Total Score) shows that the Mathematies Placement 
Examination when used alone accounts for only 20 per cent of the 
remaining variance in college grades (.45?). 

In this table, the median r for the Intermediate Algebra Score of 
the Mathematics Placement Exam is, for all three groups, at least 
as high as that for the Total Score. This implies that the Inter- 
mediate Algebra Score is usually a more useful predictor than is 
the Total Score and that to improve the prediction of college 
grades by the use of the other subtests in addition to Intermediate 
Algebra, a more nearly optimal weighting of the subtest scores 
would be necessary. 

Summary, The results of these analyses imply that the addition 
of the Mathematics Placement Examination improves college 
mathematics grade prediction based on the TH Index, and that 
the increase is greater for higher-level courses. Whether the in- 
crease in predictive efficiency using the Mathematics Placement 


En Ap us ME 
2 
ws med be understood, however, that the correlations between ACT 
Mathes? and college grades may be Severely understated since ACT 
athematics was used as the criterion to define these ability levels. 
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PREDICTING ALGEBRA ACHIEVEMENT WITH AN 
ALGEBRA PROGNOSIS TEST, IQS, TEACHER 
PREDICTIONS, AND MATHEMATICS GRADES 


GERALD 8. HANNA 
Kansas State University 


HAROLD F. BLIGH AND JOANNE M. LENKE 
Harcourt, Brace & World, Inc. 


JOSEPH B. ORLEANS 
Ramaz School, New York City 


Scuoon grades, IQs, achievement tests, special-purpose prog- 
nosis tests, and recommendations by eighth-grade mathematics 
teachers have all been used to predict algebra success. Although 
studies of the validity of teacher recommendations have not ap- 
peared in the literature recently, multiple-regression analyses in- 
volving various combinations of the other kinds of predictors 
have been reported (Dinkel, 1959; Duncan, 1960; Barns and 
Asher, 1962; Sabers and Feldt, 1968). in 

The purposes of this study were to compare the validities of a 
special-purpose prognosis test, 1Qs, teacher-predicted algebra grades, 
and mathematics grades in predicting algebra success and to esti- 
lidities of the individual predictors 
coud be raised by using multiple-regression procedures. 

Methods. The initial sample consisted of 1,105 eighth-grade math- 
ematics students in nine schools in six states who took the Orleans- 


Hanna Algebra Prognosis Test in April and May of 1967. This test 


contains 58 work-sample test items, student-predicted mid-year 


algebra grades, and student-reported most recent report-card grades 
in mathematics, science, English, and social studies. The grades are 
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scored A = 8, B= 6, C = 4, D = 2, E/F = 0. Thus the maximum 
total score is 98. 

At the time of testing, schools reported each examinee's most 
recent report-card mathematics grade. In addition, eighth-grade 
mathematies teachers predicted the mid-year grade each student 
would receive in algebra if he were to take the subject. Devia- 
tion IQs (DIQs) obtained during the October-November, 1966, 
standardization of the Otis-Lennon Mental Ability Test were 
secured. Teachers were given no instruction as to the use or non-use 
of DIQs in predicting algebra grades. 

During January and February of 1968, algebra teachers re- 
ported students’ mid-year algebra grades. In May and June, the 
Lankton First-Year Algebra Test, Revised Edition was administered 
and algebra teachers reported year-end grades. Prognosis and 
achievement test scores were not released to schools until all 
algebra grades had been assigned. The final sample consisted of 
those students for whom all predictor and criterion measures were 
available. 

Zero-order correlations among the variables were computed, and 
three sets of stepwise regression analyses were conducted. The first 
multiple-regression analysis for each criterion used the Algebra 
Prognosis Test total scores, DIQs, and teacher predictions as in- 
dependent variables;1 the second analysis for each criterion in- 
cluded scores on the work-sample section of the prognosis test, 
DIQs, teacher predictions, and school-reported eighth-grade mathe- 
maties grades; the third set, of analyses used DIQs, teacher pre- 
dictions, and mathematics grades, 

_ Results and discussion. Table 1 Teports means, standard devia- 
tions, and intercorrelations of variables. As would be expected, the 
total Prognosis test scores and the scores on its work-sample sec- 
tion were highly correlated. Although the work-sample section was 


slightly more valid than the total test in predicting achievement 


test scores, the total test score: i i 
pride P S correlated higher with both grade 


Both the total prognosis test and its work-sample section pre- 


— 

1 |- P 
RE mathematics grades were not included in the first set of 
graden that ave inch their pedandsiige with the student-reported mathematics 
Bligh, 1968). luded in the total Algebra Prognosis Test score (Hanna and 
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TABLE 1 
Zero-Order Correlations (N = 310) 


Variable M SD 2 38. 4 5 XS 


vehe ^ MOMED LE S n TT 


. Prog. Test Total 


m 


Score 62.92 18.07 .96 .68 .80 .62 .68 .66 .78 
2. Work-Sample Items 

Only 38.17 13.46 57 .82 .50 .63 .02 .80 
3. 8th Grade Math. 

Grade* 2.39 0.94 .49 .72 .54 .50 .49 
4. Otis-Lennon DIQs 104.10 16.31 40 .56 .55 .70 
5. Teacher-Pred. 

Grade* 2.25 1.00 -57 .53 .46 
6. Mid-Year Algebra 

Grade* 2.27 0.99 .81 .60 
7. Year-End Algebra 

Grade* 2.16 1.01 -60 
8. Year-End Algebra 

Test* 24.75 10.12 


*A =4,B =3,C =2,D =1, E/F 70. 


dicted all eriteria more accurately than did DIQs, teacher predio- 
tons, or mathematics grades. Since the mental ability test was 
administered about six months before the prognosis test, direct 
comparison of their predictive validities may be slightly biased in 


favor of the prognosis test. The DIQs surpassed teacher predictions 


and mathematics grades in predicting achievement test scores, 


while the three variables were about equally valid for predicting 


the grade criteria. Both the total prognosis test scores and scores 


on its work-sample section correlated slightly higher with algebra 


grades than did the algebra achievement test scores. 

Results of the three sets of multiple-regression analyses are pre- 
sented in Table 2. It is seen in the top section that teachers’ pre- 
dictions and mental ability test scores added little to the prognosis 
test’s validity in predicting the criteria. Since the Algebra 


Prognosis Test is a composite of several variables, the one set 
of multiple-regression analyses used only the awa Ae ion 
of the prognosis test. Teacher-predicted algebra grades K i e " 
nificantly to the prediction of each grade criterion. 4 Qs n 
eighth-grade mathematics grades added very little to ; e al 
tion of any criterion. The bottom set of analyses used d n aad 
predictions, and mathematics grades. DIQs contributed pe a ae 
the prediction of the year-end criteria. Teaser a u $ 
grades contributed substantially in predicting each grade criterion. 
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TABLE 2 
Stepwise Regression Analyses (N = 810) 


Mid-Year Grade Year-End Grade Year-End Algebra Test 
Predictors R Predictors R Predictors R 
in Order of ^ Each in Orderof Each in Order of Each 

Entry Step Entry Step Entry Step 
Alg. Prog. Test* -681 Alg. Prog. Test* .662 Alg. Prog. Test* — .781 
Teacher Pred.* -706 Teacher Pred.* .679 DIQ* .192 
DIQ -708 DIQ .682 "Teacher Pred. .792 


Work-Sample Item* .630 Work-Sample Test* .617 Work-Sample Test* .799 
8 


"Teacher Pred.* .694 "Teacher Pred.* -667 DIQ* .804 
DI .698 DIQ* -673 Teacher Pred.* .806 
Math Grade .699 Math Grade .673 Math Grade .806 
Teacher Pred.* 566 DIQ* .553 DIQ* «703 
DIQ* -670 Teacher Pred.* -646 Teacher Pred.* 728 
Math Grade -674 Math Grade .648 Math Grade -731 


* Beta weight significant (p < .05). 


Mathematics grades again added virtually nothing to the predic- 
tion of any criterion. 

Contrary to the findings of this study, other investigators (Dinkel, 
1959; Duncan, 1960; Barns and Asher, 1962; Sabers and Feldt, 
1968), using prognosis tests and other predictors, have reported 
multiple correlations that distinctly exceeded the zero-order pre- 
dictive validities of prognosis tests when used alone. Comparison of 
the top two sections of Table 2 suggests that the inclusion of 
student-reported past grades and student-predicted algebra grades 
in the Algebra Prognosis Test may account for the small validity 
increments realized when teacher predictions and DIQs were used 
with the prognosis test scores. Teacher predictions and DIQs in- 
creased the zero-order validities of the total prognosis test much 


less in the first set of analyses than they improved the zero-order 
validities of the 


ses. These findi 
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THE CONTRIBUTION OF WORK-SAMPLE 
TEST ITEMS, PAST GRADES, AND STUDENT- 
PREDICTED GRADES TO THE PREDICTION OF 
GEOMETRY ENROLLMENT 


GERALD S. HANNA AND JOHN T. ROSCOE 
Kansas State University 


A considerable amount of research has been devoted to predict- 
ing success in high school geometry. Studies have been concerned 
with who should take geometry and to what extent students are 
likely to benefit from the course as commonly taught (Hanna, 
1966). Previous workers have not used predictive variables pri- 
marily to identify which students will take geometry. In other 
words, attention has been on predicting Success in geometry rather 
than enrollment in the subject. Y 

In this tradition, the Orleans-Hanna Geometry Prognosis Test 
was developed to predict degree of geometry success. The test em- 
ploys the practice of having students report their past grades in 
four school subjects. In addition, each examinee predicts the grade 


he would receive in geometry if he were to take the subject. The 


rationale and empirical procedures leading to the weighting of 


these five variables and the 40 work-sample test items d n 
presented elsewhere (Orleans and Hanna, 1968; Bligh, Lenke, 8: 
g me 
NE sis tests is predicting 
This notion was suggested 


in algebra might well differ 
geometry from those who were 
909 
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minal mathematics students differ in many respects from those of 
continuing students. 

Purpose. 'The purposes of this study were (a) to identify the con- 
tribution of the six parts of the Geometry Prognosis Test in dis- 
eriminating students who subsequently took geometry from those 
who did not, (b) to compare the validities of the discriminant func- 
tion equation and the standardized weighting of the test parts in 
predicting geometry enrollment, and (c) to provide enrollment 
expectancy data for the Geometry Prognosis Test. 

Procedure. In April and May of 1967, 745 first-year algebra stu- 
dents from nine four- and six-year high schools in seven states 
took the Geometry Prognosis Test. Results were not released to 
schools until the end of the following school year. The principal 
criterion was enrollment in geometry the following school year. 
More conventional achievement criteria included the Mid-Year 
Geometry Test and mid-year grades. Students in geometry courses 
at the end of the first half of the 1967-68 school year were desig- 
nated the geometry group in this study. Students not taking geome- 
try that year in these same schools were designated the nongeome- 
try group. The latter group obviously included some students who 
moved and took geometry in their new school and others who en- 
rolled in geometry at a later time. These cases of misclassification 
probably attenuated the validity of the test and its parts. 

Findings. Table 1 presents the means and standard deviations of 
the six predictors for students in each group. All six differences in 
means are significant (p < .001). The point biserial correlation 


between each test part and the enrollment criterion are also given 
in Table 1. 


Table 2 reports for the total sa: 
six predictors, 

The first column of Table 3 
validity coefficients for the six 
ment criterion. The maximum y 


mple the correlations among the 


reports the b-weights and multiple 
variables in predicting the enroll- 
: alidity with which enrollment could 
be predicted was .462. The second column in Table 3 contains pro- 
portional integral approximations to the b-weights. The correla- 
tion for Predicting enrollment was decreased only from .462 to 
455 by this rounding procedure. The right-hand column gives the 


standardized weights developed for predicting success criteria. The 
43 validity of the conventionally weighted parts of the Geometry 
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TABLE 1 
Descriptive Data for Students in Criterion Groups 


Enrolled Not Enrolled 
(N = 478) (N = 267) 

oa pees 
Variables M SD M SD Tobi 
Algebra Grade* 2.86 1.08 1.49 0.99 E 

Science Grade* 2.560 0.88 2.08 0.81 
English Grade* 2.64 0.87 2.00 0.89 31 

Soc. St. Grade* 2.78 0.92 2.97 0.93 2 

Predicted Grade* 2.41 0.83 1.61 0.95 40 
Prog. Test Items 19.04 8.46 14.31 758 30 


-A = 4,B =8,C =2,D =1,F =0. 


Prognosis Test is sufficiently close to the 46 optimal validity to 
suggest that special weighting for predicting geometry enrollment 
is of limited benefit. The most dramatic difference between the 
d success criteria js the con- 


times as much weight in predicting enrollment 
dardized weighting for predi 
esting finding revealed by Ta 
Geometry Prognosis Test in predicting ge 
dicting enrollment. The various weightings of the test parts pro- 
duced validities in the mid 40’s for the enrollment criterion while 
the standard weighting of the test parts yielded validities in the 
mid .60's for criteria of geometry Success. 

Table 4 gives the per cent of students in each conventionally 
weighted Geometry Prognosis Test total score decile that en- 
rolled in geometry the following year. These data facilitate pre- 


dicting geometry enrollment of individual students. 
TABLE 2 
Intercorrelation of Vari les (N = 745) 
Variables M SD era) wei 6 
38 
1. Algebra Grade 2.04 1.12 43 45 .97 0l 
2. Science Grade 2.39 0.89 42 A A 3 
3. English Grade 2.42 0.92 : un 
4. Soc. St. Grade 2.60 0.96 4 
. Predicted Grade 2.12 0.96 
7.73 8.55 


5. 
6. Prog. Test Items 17. : 
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TABLE 3 
Multiple-Regression Findings for Enrollment Criterion 


Rounded Standardized 
Variables b-Weights b-Weights Weights 
Algebra Grade .026 4 2 
Science Grade .006 1 2 
English Grade .030 5 2 
Soc. St. Grade .005 1 2 
Predicted Grade .053 8 2 
Prog. Test Items .007 1 1 
Enrollment Validity -462 455 43 
Grade Validity .63 
"Test Validity -65 
TABLE 4 
Expectancy Table 
Geom. Prog. Test Per Cent 
Total Score Taking Geom. 

62—80 89 

53—61 83 

47—52 84 

44—46 85 

39—43 74 

36—38 69 

33—35 56 

30—32 52 

26—29 24 


; Findings of this study should be interpreted with some caution 
in the absence of cross-validation. Moreover, predictions of geome- 
try enrollment should ideally be based on expectancy studies from 


the local school in order to reflect any unusual local circumstances 
that may be present, 
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THE VALIDITY OF A PRE-STATISTICS SURVEY 
TEST OF BASIC MATHEMATICAL SKILLS IN 
RELATION TO ACHIEVEMENT IN THE FIRST 

COURSE IN PSYCHOLOGICAL STATISTICS 
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Instructors in beginning courses in 
fered by departments of psychology ani 
the experience that many students lack 
metic skills and in simple alge 
Witte, urged several professo 
San Jose State College to consi 
mathematical competencies for stu 
tistics. Accordingly, Edward Minium an: 
matical skills necessary for elementary statistics, 
objectives, prepared test items, 
for his editorial review, and then cons 
form that was administered by Ro 
preliminary validity data. 

The first 35 items required basic arithm! 
tion, subtraction, multiplication, division, an 


ing the use of integers, proper and improp 
ntages, and ratios. C 


points, proportions, perce! 
the second part was concerned with e 
f relatively compl 


ons, and sul 


requiring simplification 0 
of relatively easy linear equati 
values for letters in formulas. 
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psychological statistics of- 
d education have long had 
competency in basic arith- 
praic operations. In 1964, Robert 
rs in the Psychology Department at 
ider developing and using a pretest in 
dents entering elementary sta- 
alyzed the types of mathe- 
defined course 
submitted them to Robert Clarke 
tructed a 50-item four-choice 
bert Witte, who also obtained 


etic operations of addi- 
d square root involv- 
er fractions, decimal 
onsisting of 15 items, 
Jementary algebraic problems 
lex expressions, solutions 
stitution of numerical 
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During the following spring semesters of 1964 and 1965, the 
Pre-Statistics Survey Test of Basic Mathematical Skills (PST- 
BMS) was administered to 507 students who were enrolled in 
Statistics 115A. Nearly half were majors in psychology. About 
half were pursuing studies in the social sciences and education. 
There was no high school mathematics requirement either for 
admission to the college or for enrollment in the course. Of the 219 
students who received a score of 39 points or higher in the 50-item 
test, the percentages (rounded to the nearest unit of five) earning 
grades of A, B, ©, D and F, or other grades (I—Incomplete, W— 
Withdrawn, and Dropped) were, respectively, 30, 45, 20, 0, and 5; 
of the 174 students who obtained scores of between 30 and 38 points, 
the corresponding percentages were 10, 35, 30, 10, and 15; and of 
the 114 students who fell below a score of 30, the corresponding 
percentages were 5, 20, 40, 20, and 15. (Minium, 1969) . 

Purpose. The two principal objectives of this investigation were 
(1) to present criterion-related validity data of the PSTBMS for 
two graduate classes of 58 and 85 students who were enrolled in an 
introductory statistics course during the 1969 spring semester and 
1969 summer session at a metropolitan university and for two 
upper division classes of a combined enrollment of 41 students (of 
whom 30 were graduate students) at a large California state college 


and (2) to report briefly item-analysis data obtained from the two 
university classes, 


completing the examination was unlimited (actually no students 
ded time limit of 30 minutes 
examinees. For the group of 41 
the examination. Use of vari- 
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midterm and final examination were given; in addition, a short 
10-item test involving a programmed learning task was also ad- 
ministered as an exercise for extra credit during the final examina- 
tion. For the other two groups data were available for only one 
eriterion examination. 

Means, standard deviations, and product-moment coefficients of 
correlation were calculated. For the two university classes, item 
analyses were effected through use of a special computer program 
(Jones, Pullias, and Michael, 1965), which also furnished internal 
consistency estimates of reliability. All data were based on stu- 
dents who completed the courses and fulfilled all requirements. 
These students constituted for each sample slightly more than 90 
per cent of those who took the PSTBMS. 

Findings. The major findings are presented in Tables 1 and 2. 
Although not shown in the tables, item analysis data may be sum- 
marized for the two university groups. For the sample of 58 stu- 
dents that was allowed unlimited time, the difficulty level in terms 
of proportions of individuals answering items correctly (not cor- 
rected for chance success) varied from .98 to .46 with the median 
being .82. The point biserial coefficient of index discrimination in 
terms of placing in the upper half or lower half in total scores 


TABLE 1 


Intercorrelations among Pre-Statistics Survey Test and Each of Four Criterion 
Tests Along with Means, Standard Deviation, and Number of Items 


(N = 58» 
—————————————— 
Number 

Variables a @) @) @ © x o of items 
1. Pre-Statistics 

Survey Test 88» 39 45 40 47 40.79 6.58 50 
2. Criterion Midterm 

Exam 39 — 6 50 91 70.12 6.55 85 
3. Criterion Final Exam 45 65 — 47 90 49.81 6.35 60 
4. Criterion Pro- 

grammed Learning 

Subtest 40 50 47 — 5 6.52 1.65 10 
5. Total Criterion 

(2) + (8) 47 91 90 53 — 119.93 11.67 145 


$ Decimal points in correlation coefficients are omitted. 
Reliability estimate based on Kuder-Richardson Formula 20. 


_ 

& Appreciation is expressed to the following individuals of the University of 

SEE California for their assistance in data processing: Hudhail, Al-Amir, 
bert A. Jones, Young Lee, and Calvin Pullias. 
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During the following Spring semesters of 1964 and 1965, the 
Pre-Statistics Survey Test of Basie Mathematical Skills (PST- 
BMS) was administered to 507 students who were enrolled in 
Statistics 115A. Nearly half were majors in psychology. About 
half were pursuing studies in the social sciences and education. 
There was no high school mathematics requirement either for 
admission to the college or for enrollment in the course. Of the 219 
students who received a score of 39 points or higher in the 50-item 
test, the percentages (rounded to the nearest unit of five) earning 
grades of A, B, C, D and F, or other grades (I—Incomplete, W— 
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of the 174 students who obtained scores of between 30 and 38 points, 
the corresponding percentages were 10, 35, 30, 10, and 15; and of 


f rge California state college 
and (2) to report briefly item-analysis data obtained from the two 


Methodology. Whereas for the class of 58 students the time for 
completing the examination was unlim} 
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the class, feedback was provided 
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midterm and final examination were given; in addition, à short 
10-item test involving a programmed learning task was also ad- 
ministered as an exercise for extra credit during the final examina- 
tion. For the other two groups data were available for only one 
criterion examination. 

Means, standard deviations, and product-moment coefficients of 
correlation were calculated. For the two university classes, item 
analyses were effected through use of a special computer program 
(Jones, Pullias, and Michael, 1965), which also furnished internal 
consistency estimates of reliability. All data were based on stu- 
dents who completed the courses and fulfilled all requirements. 
These students constituted for each sample slightly more than 90 
per cent of those who took the PSTBMS. 

Findings. The major findings are presented in Tables 1 and 2. 
Although not shown in the tables, item analysis data may be sum- 
marized for the two university groups. For the sample of 58 stu- 
dents that was allowed unlimited time, the difficulty level in terms 
of proportions of individuals answering items correctly (not cor- 
rected for chance success) varied from .98 to .46 with the median 
being .82. The point biserial coefficient of index discrimination in 
terms of placing in the upper half or lower half in total scores 


TABLE 1 


Intercorrelations among Pre-Statistics Survey Test and Each of Four Criterion 
Tests Along with Means, Standard Deviation, and Number of Items 


hiat i D ER 
Number 

Variables 09090020 MX c of items 
1. Pre-Statistics 

Survey Test ss 39 45 40 47 40.79 6.58 50 
2. Criterion Midterm 

xam 39 — 65 50 91 70.12 6.55 85 
3. Criterion Final Exam 45 65 — 47 90 49.81 6.35 60 
4. Criterion Pro- 

grammed Learning 

Subtest 40 50 47 — 53 6.52 1.65 10 
5. Total Criterion 

(2) + (8) 47 91 90 5 — 119.93 11.67 145 


* Decimal points in correlation coefficients are omitted. 
© Reliability estimate based on Kuder-Richardson Formula 20. 


——————— 
1 Appreciation is expressed to the following individuals of the University of 
uthern California for their assistance in data processing: Hudhail, Al-. i 
Robert A. Jones, Young Lee, and Calvin 
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TABLE 2 


Descriptive Statistics for the Pre-Statistics Survey Test and for the Final Objective 
Achievement Examination of 85 Items (Criterion Variable) 


University State College 
(N = 85) (N = 41) 
Variables M c M c 
1. Pre-Stat Survey Test» 36.63 9.19 38.61 9.65 
2. Criterion Examination 69.79 8.30 62.29 9.99 
nus = .58 fis = .57 
Tu = .92 fi = — 


UEM allowed students at the university and the state college, respectively, 30 and 40 
* Kuder-Richardson Formula 20 estimate of reliability, 


ranged from .02 to .57 with a median of .40; and the phi coeffici- 
ents varied from .00 to 48 with a median of .26. Of 50 items, 27 
showed phi indices significant at or beyond the .05 level. 

For the sample of 85 students that was given exactly 30 minutes 


(a) Es was substantial curtailment of Tange in scores of the 
MS resulting from about a ten per cent dropout of 


peri (all of whom had low scores on the PSTBMS) 
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) The item-analysis data pointed to promising internal consis- 
tency, although certain items could profit substantially from 
editorial revision. 
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THE USE OF THE QUICK NUMBER TEST IN THE 
PREDICTION OF ACADEMIC PERFORMANCE 


EDGAR F. BORGATTA 
University of Wisconsin 


GEORGE W. BOHRNSTEDT 
University of Minnesota 


As described in a previous paper (Corsini and Borgatta, 1968), 
the Quick Number Test (QNT) was designed to assess perfor- 
mance in the numerical area with a brief but efficient test. The 
test had several limitations that were noted, the most serious of 
which occurred for use with highly educated groups. In particular, 
the ceiling of the test is relatively low, and persons who have had 
some exposure to college are likely to obtain high scores on the test. 
This skews the distribution, and only those who lack facility in 
the handling of numbers are distinguishable by the test. However, 
the projected use of the test is for differentiating among those who 
have an adequate level of performance and those who do not. It is 
not intended to be a test to locate the outstanding performers. 

Two kinds of questions arise in the use of the QNT with college 
students. Does the relatively low ceiling affect the test as a screen- 
ing device? Second, how does the QNT compare with numerical 
subtests of standard college entrance examinations? Actually, by 
answering the second question, some perspective is gained on the 
first. Therefore, it was decided to select a sample of students for 
whom other test scores were available and for whom grade point 
averages would also become available. 

Procedure. Cooperation was elicited from professors of the intro- 
ductory sociology course at the University of Wisconsin, and the 

NT was administered along with other tests as an exercise to 
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illustrate the relationship of general tests to specific classroom per- 
formance. The Scholastie Aptitude Test (SAT) (Educational Test- 
ing Service, 1967) battery and the American College Testing Pro- 
gram (ACT) (American College Testing, 1967) were administered 
to students at the regularly schedule times controlled by the test 
agencies, while the College Qualification Test (CQT) (Bennett, 
Bennett, Wallace and Wesman, 1961) was administered at the 
orientation session during the first week on campus. 

Table 1 presents the zero-order correlation coefficients between 
all subtest scores. In addition, the zero-order relationships between 
the subtest scores and the high school percentile (HSPC) of the 
student as recorded in the files of the University, and the grade 
point average at the end of the first semester of school are shown. 

Two samples are represented in the data. The University requires 
all students to take the CQT and either the SAT or ACT. The 
smaller sample (“Sample A”) includes those who had taken the 
SAT (N = 342). The larger sample (“Sample B”) includes those 
who opted to take the ACT. Since 74 took both the SAT and 
ACT tests, the samples overlap by that amount, 


TABLE 1 
Intercorrelations between Test Scores, HSPC, and GPA 


SUCCO le 


ee A 
= 342) 
QNT  CQTV CQTN SATV SATM HSPC GPA 
QNT m +10 .80 Bv .52 22 35 
cane = 07 -73 .20 .00 .87 
SAN — 8 .62 .8 40 
E — .46 .06 -38 
peti — 14 85 
GPA a S 
a oe 
N = 509) 

QNT  CQTv CQTN ACTE ACTN HSPC GPA 
bse = 8 -76 .82 -68 28 28 
Carn ;31 .60 .81 .20 .40 
AGIS — -36 .80 .88 +34 
AOIN — .37 ier .89 
spo — .28 84 


GPA 6n 3 


Note.—Values involving GPA or HSPC and a numerical test are italicized, 


€ 
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Findings. The relationships among the tests. As indicated in 
Table 1, the QNT is substantially correlated with the other tests 
of numerical ability. The QNT is highly related to the CQTN as 
are the other tests, but the CQTN is somewhat more highly cor- 
related with the SATM and the ACTN than is the QNT for these 
samples. 

The relationship of the tests and HSPC. High school percentile 
- may be used as à criterion measure for abilities tests. In general, it 
© would be expected that all types of abilities would be positively 
correlated, and verbal and numerical abilities would be especially 
related to a criterion of academic success. The relationship of the 
numerical tests to the high school percentile appears to be similar 
for the QNT, the CQTN, and the ACTN, with the CQTN possibly 
having a somewhat higher correlation coefficient in Sample B. The 
SATM possibly has a slightly lower correlation with the high school 
percentile. 

The relationship of the tests to the GPA. The GPA after the first 
semester of study represents in large part the results of participa- 
tion in required and/or elementary courses. Generally speaking, 
this involves four or five courses, and again the criterion may be 
judged to be relatively imperfect because of variations of difficulty 
of courses. In theory, some degree of homogeneity of the samples 
should be expected on the grounds that all students registered in an 
elementary sociology course. The relationship between the numeri- 
cal tests and the GPA appears to be in the same range, with the 
CQTN possibly showing some superiority in Sample A and the QNT 
possibly showing some inferiority in Sample B. 

Ordinarily, the task of prediction of grade point average oT other 
criteria involves the use of multiple correlation with predictors 
such as ability tests or prior performance measures (eg., HSPC). 
Thus, a second analysis was carried out that involved the multiple 
correlation coefficients of a numerical test, a verbal test, and the 
HSPC. 

The data for the comparisons may be seen in Table 2 which pre- 
sents the multiple correlation coefficients of each numerical test 
and each verbal test plus the HSPC to GPA. Additionally, the 
squared values of R, the coefficient of determination, are presented. 
They indicate the amount of variance in the GPA criterion that is 
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TABLE 2 
Multiple Correlation Coefficients from Numeric and Verbal Test 
plus HSPC to GPA 

Sample A 
Repa-ant,cary,usrc = .53 R-.28 
Rora-catn, cary, msPo = .56 R = .32 
Rara- sarm, cary, xspc = -52 R= 27 
Repa-ant, saty,narc = .51 R = .26 
GPA+CQTN, SATV, HSPO = .54 R = .29 
Rara- sare, sary, spo = .48 R = .23 

Sample B 
Rapa-ant.carv,asrc = .49 R = .24 
Rapra + cary, cary, spo = .50 R = .25 
RaorA-AoTN.carv,Hspo = .50 R = .26 
Rora. ant, acte, msPo = .48 m-.2 
Rapa-catn, aote, uspo = .49 R= .24 
Rara+actn, aote, uspo = .49 Rt = ,24 


predicted by the three indicators, In general, it is seen that the R 
values are in the same range. 

Conclusion. In this report data have been presented to indicate 
the relative efficiency of the QNT. In spite of the fact that the 
QNT has a low ceiling, le, it cannot discriminate well among 
high performers, the QNT appears to be an instrument that is able 
to discriminate as well as other numerical tests in the prediction 
of the criterion of GPA, The previously reported finding that the 
QNT Predicts other numerical tests in the range of college screen- 
ing (Corsini and Borgatta, 1968) is also confirmed in the current 
studies. Thus, the use of the QNT for screening purposes even with 


college students is justified. More general implications are that the 
development of more efficien’ 
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NORMATIVE AND PARTIAL VALIDITY 
DATA FOR THE KAB-CBS 


HARRY E. ANDERSON, JR. ann W. L. BASHAW 
University of Georgia 


YUNGHO KIM 
American Institutes for Research 


DONALD A. LETON 
University of Hawaii 


Kim, Anderson, and Bashaw (19682) published the develop- 
mental procedures and results related to the Kim-Anderson-Bas- 
haw Child Behavior Scale (KAB-CBS) for children in the second 
grade. The final form of the KAB-CBS consists of 6 items in each 
of three factors of maturity; viz., ‘Academic, Interpersonal, and 
Emotional (18 items in all). Further studies (e.g, Anderson and 
Bashaw, 1968) have shown that the same three factors are ob- 
tainable with four- and five-year olds; moreover, the Interpersonal 
and Emotional Factors emerged in analyses of data with three- 
year old children but the Academic Factor lacked definition at that 
Age level. 

In the development of the KAB-CBS, two of the item selection 
criteria were ratings by teachers for each item on the following 
characteristics: (1) Could the teacher make a valid judgment about 
| the child with regard to the item? (2) Would a response to the 

| item in any way embarrass the teacher, the child, or his family? 
Although the items were originally derived from the developmental 
literature, teachers were not queried with regard to the relevance 
of the item to maturity. A second article by Kim, Anderson, and 
Bashaw (1968b) presented the relationship of the Factor scores 
to ability and achievement measures which may be particularly 
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appropriate as a validation of the Achievement Factor, but may 
have little relevance for the validity of scores on the Interpersonal 
and Emotional Factors. 

Purpose. The purpose of the present study was to provide rat- 
ing information with regard to the relevance of behavior reflected 
in each of the items to maturity in the development of children. 

Moreover, because of the many requests since the publication of 
the KAB-CBS, a second purpose of this article was to furnish 
normative information based on the original Kim-Anderson-Bas- 
haw (1968a) study. 

Method. The subjects for the validity study consisted of 293 
teachers enrolled in summer courses in a College of Education. 
The teachers rated all of the 18 items with regard to characteristics 
first of Desirability, then Importance, then Frequency, and lastly, 
contribution to maturity (Contribution). For instance, with regard 
to the third characteristic, the teachers were told: 


Thinking of maturity in the development of children (e.g., six 
to ten years of age), rate each of the following 18 items with re- 
gard to how frequently you think children ought to exhibit be- 
havior as reflected in the item. Use the seven-point scale with 


each item so that ‘7’ Tepresents ‘very infrequently’ to ‘7’, ‘very 
often.’ 


The four characteristics were selected on the basis of discussions 
with experts in the field of elementary education, On the 7-point 
item scales across all four characteristics, the ‘1’ always represented 


a negative type of response while the ‘7’ reflected a positive type of 
Tesponse, 


-O; moreover, the Desirability and Con- 
higher ratings across all 


Across all characteristics except, 
The item rating distributio, 
& more detailed examinati 
"i Percentages of ratings 


1 
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| With regard to the Desirability characteristic, all six items in the 
"Aeademie Factor have mean ratings above 6.00, while three of six 
item-mean ratings in each of the remaining two factors are equally 
“high. For items 9, 12, 14 and 16, 16 to 17 per cent of the ratings 
“were at a scale value of 4 or below. Percentages of ratings at the 
‘scale value of four or below for all other items ranged from one to 
41, In general, then, the items reflect desirable behavior in terms of 
J maturity in the development of children. 
j In the Importance characteristie, again all six items in the 
4 ‘Academic Factor have mean ratings above 6.0, but only one item in 
- each of the other two Factors has mean ratings that high. The per- 
centages of ratings at or below the scale value of four for items 
7, 9, 10, and 12, ranged from 20 to 27, while similar rating per- 
centages for all other items ranged from two to 15. Although De- 
" sirability ratings were higher than Importance ratings, the teachers 
< judged the items as tending to reflect behavior that is important 
C with regard to maturity in the development of children. 

The mean ratings for items with respect to the Frequency char- 
acteristic are generally lower so that only two items are above 6.0, 
but the lowest mean rating is 5.11, which is in the favorable di- 
“rection. Again as with Importance, items 7, 8, 10, and 12 were not so 
strongly endorsed as the others with percentages of ratings at and 
"below value 5 for items other than 7, 9, 10, and 12 ranged from 
, five to 13. 
The item ratings as regards Contribution are high so that all 
Academic Factor item rating means are above 6.0; two of the six 
i Interpersonal and four of the six Emotional item rating means are 

likewise above 6.0. As for the percentages of responses at or below 
the scale value of 4, only item 9 had the relatively large percentage 
- of 23, although 15 per cent of the ratings were also at these scale 
- values for items 7, 10 and 12; similar percentages for all other 
items ranged between four and 11. There appears to be strong 
agreement that the behavior represented by the items tends to con- 
"tribute to maturity in the development of children. 

Percentile norms for a white sample of second graders are pre- 
sented in Table 2 for each of the subscales of the KAB-CBS. A 
T-point scale was used with each item so that the range of possible 
- Stores is from six to 42, and the percentile norms are presented 
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TABLE 1 


KAB-CBS Item Means and Standard Deviations (SD) from 293 Teacher Ratings, Using a 
a Seven-Point Scale, with Respect to Four Characteristics Regarding M: aturity 


Characteristics 


Desirability ^ Importance Frequency Contributi 


KAB-CBS Items Mean SD 


1, The student can work 

alone for a period of time. 6.41 0.93 6.21 0.99 5.58 1.17 6.33 0.9) 
2. The student returns to a 

task unfinished from the 

previous day and de- x 

velops it. 6.27 1.02 6.09 0.97 5.69 1.02 6.23 1.0 
3. The student carries ac- 

tivities to completion. 6.44 0.92 6.28 0.84 6.05 1.03 6.41 0.84 
4, The student carries out 

brief individual assign- 

ments in school without 


supervision, 6.41 0.89 6.11 0.91 5.76 1.12 6.31 0.9 
5. The student reads on 

his/her own initiative, 6.41 1.00 6.28 0.97 5.95 1.16 6.37 0.91 
6. The student enjoys books, 

newspapers, and/or 

magazines, 6.40 0.90 6.10 0.96 5.81 1.16 6.16 1.08 
7. The student enjoys team 

games and group games, 6.01 1.03 5.47 1.12 5.38 1.08 5.58 1.16 
8. Thestudent makes friends 

quickly and easily, 6.09 0.94 5.87 0.94 5.82 0.94 6.04 0.9 
9. The student takes part, 

1n competitive games, 5.01 1.16 5.15 1.22 5.22 
10. pv Hiec takes initia- 

ive at play or in the 

classroom. 


5.77 1.11. 5.7 1.12 558 1.10 5.01 1.09 
11. The student is friendly 


toward other people. 6.18 0.90 6.00 0.95, 
12. The student assumes 

group leadership fora 

Given activity, 5.00 1.11 


Mean SD Mean SD Mean SD 


6.0 0.95 6.07 0.92 

526 1314 5.11 1.11 5.63 1.10 

5.96 100 5.99 1. „oi 

14. Tho student net 1.05 5.77 1.16 5.97 1.01 
erly to the teacher’s 


approval or disapproval, 5.70 1.16 5.60 1.17 5.59 1.12 5.83 1.07 


9.12 1.02 5.95 1.04 5.88 1.15 6.17 0.94 
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5.70 1.30 5.64 1.19 5.64 1.23 6.07 1.12 


6.15 1.00 5.96 1.05 5.78 1.17 6.24 0.90 


6.00 1.05 5.93 1.00 


6.26 0.98 


ly for males and females. It may be of interest that females, 
enerally, are rated as being more mature than males. 
ssion. The results of the present study lend fairly strong 
rt that the KAB-CBS items reflect behavior that is appro- 
ly related to maturity in the development of young chil- 
‘As indicated by the teachers’ ratings of the items in relation 
: characteristics (viz., Desirability, Importance, Frequency, 
Contribution to Maturity), the lowest mean ratings on any item 
l Da seven-point scale was 5.11 and many rating means were 
e 6.00. Indeed, in 82 per cent of the item ratings, at least 85 
t of the teachers rated the item at least five points, in a posi- 
direction, on a 7-point scale. 
items numbered 7, 9, 10, and 12 would seem to provide the 
st confidence, relatively speaking, in terms of relevance to 
ity. All of these items are in the Interpersonal Factor so that 


factor may prove to be the most troublesome in future experi- 


work, From 73 to 85 per cent of the teachers however, 


ith the above items, rated each item five points or more. 
original Kim-Anderson-Bashaw (19682) study provided fac- 


validity for the items and their subsequent study (1968b) 


Hide concurrent validity, at least for the Academic Factor. Some 
haps experimental studies 


s similar to the latter one, or per 
| children, might well be undertaken to provide further valida- 
of the subscales. Since the KAB-CBS was constructed for 
use, however, it would appear that teacher-judgment stud- 
such as the present one, are fundamental. ; 
ally, we should point out that the characteristics selected for 
in the present study are certainly not independent. Individual 
correlations across the four characteristics range from 20 to 
‘with most in the .40's and .50’s. No studies could be found to 
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TABLE 2 


Percentile Norms for White Males and Females on Sub-scales 
of the KAB-CBS 


Percentile Norms 


Males Females 
Raw Inter- Inter- j 
Score ^ Academic personal Emotional Academic personal Emotional 
42 99 99 99 99 99 2 
41 88 92 88 73 83 78 
40 85 89 86 67 81 73 
39 82 85 81 60 76 69 
38 76 79 75 57 70 62 
37 78 78 72 52 67 57 
36 71 75 68 50 64 50 
35 69 69 64 48 59 44 
34 65 67 60 44 55 40 
33 61 65 54 40 5l 37 
32 59 59 50 38 48 32 
31 57 53 45 36 43 26 
30 55 50 43 33 39 23 
29 52 45 38 29 32 19 
28 49 40 35 27 29 17 
27 47 37 31 25 27 15 
26 43 30 27 23 24 14 
25 39 26 23 20 23 13 
24 36 23 21 18 21 12 
28 33 18 18 15 19 10 
22 28 16 14 14 16 9.0 
21 25 15 12 13 15 7.4 
20 22 12 9 12 13 7 
19 20 10 6.4 11 11 5.2 
18 18 6.8 5.2 10 9.3 5 
17 17 6.0 4.4 8.2 7.1 4.5 
16 14 5.2 3.6 6.3 6.3 4 
15 11 3.6 3.6 5.2 5.2 3.3 
14 m 2.8 3.2 4.5 4.8 3 
13 10 2.4 2.0 4.1 4.5 2.6 
12 10 2.4 1.6 3.4 3:3 1.5 
n 8 2.0 1.2 2.9 2.2 7 
10 6 1.6 12 16 1.5 n 
9 5.2 8 1.2 1.5 1.5 0 
8 3.2 4 8 1,5 11 0 
i 24 4 8 1.5 ded 0 
2 8 4 4 0 
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THE VAN ALSTYNE PICTURE VOCABULARY 
TEST USED WITH SIX-YEAR-OLD 
MEXICAN-AMERICAN CHILDREN! 


ROBERT A. KARABINUS? ann MAURE HURT, JR. 
University of Arizona 


Tum Van Alstyne Picture Vocabulary Test was revised in 1961 
with a norming population of about 400 children (nearly 100 
six-year-olds) from Connecticut, Florida, and school groups from 
New York City. Only limited information concerning these chil- 
dren was available from the references (Van Alstyne, 1961; Bligh, 
1959); i.e., the mean IQ scores for the six-year-olds ranged from 
100.3 (Lorge-Thorndike) to 108.0 (Columbia Mental Maturity 
Seale). Therefore, it must be assumed that these children ap- 
proximated typical average Americans. 

Purpose. The purpose of this study was to describe the results 
of the revised Van Alstyne Picture Vocabulary Test given to two 
large groups of six-year-old Mexican-American school children 
in Tucson, Arizona. The two groups of subjects were attending 
poverty qualifying schools in Tucson School District No. 1. They 
were tested in 1965 and 1966, respectively, a8 part of a larger 
study in Early Childhood Education conducted jointly by Tucson 
School District No. 1 and the University of Arizona. The Van 
Alstyne test was chosen to measure the intelligence of these 
culturally disadvantaged children primarily because it discrim- 
inated well among the children along the low end of the scale, 
and it was also easy and quick to administer (15 minutes per 


m the University of Arizona and Tucson 
tive Research Project, 1965-1968. It was 
contract (OEC-3-7-070064-2866) with the U. S. Office 
of Education, Department of Health, Education, and Welfare. 

2 Presently at Northern Illinois University. 
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child). Even though the behaviors measured on this test have been 
questioned as representing mental ability (Buros, 1965), other Te- 
searchers have found the test (first edition) valuable in measuring 
general ability of minority and handicapped groups (Moore, 1942; J 
Dunn & Harley, 1959; Schneiderman, 1955). The basic assumption 
in this test is that picture-pointing behavior demonstrates vo- 
cabulary comprehension from which is inferred mental ability 
and intelligence. The revised instrument has 60 cards, each with 
four pictures; the child responds by pointing to the picture that 
best illustrates the word or phrase given orally by the examiner 
for each card. 

Reliability. The first group of 328 Mexican-American, Indian, 
and other Spanish-speaking children was tested in the fall of 
1965, The second group of 207 children was tested in the fall of 
1966. The ratio of boys to girls in both groups was about 52-48. 
Three types of reliabilities were calculated using raw scores: 1 
Spearman-Brown, Kuder-Richardson, and Test-Retest, with co- | 
efficients ranging from .76 to 87. (See Table 1.) A variety of 
random factors accounted for the drop of N in calculating the 
Coefficient of stability for each group. 

Comparison of these results with 
on 93 six-year-old children sele 
showed that all reliability 
the Mexican-American 


data in the test manual based 


TABLE 1 
Reliability Data Sor the Van Alstyne Picture Vocabulary Test 
B Reliabilities 
jpearman- Kuder- Test- 8 
Group N Brown Richardson Retest Rene X S En 
Mex.-Am. | 
I: 1965 328 — .7608 orm i 2-58 34.0 7.6 2M 
; = 189 : 
1:1900 — 207 861 830 .763 8-51 33.5 7.9 2.9- 
General M 00) 34 
Population 
1959 93 71 — tau 20-58 44.8 5.2 2.8 
* E-R 21 
* K-R 20 
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Test can give reasonably accurate measures for six-year-old Mex- 
ican-American children. 

Validity. Concurrent validity data were found to be above .60 
with the Stanford-Binet Intelligence Scale, Form L-M (Vocabu- 
lary), 1960, the Wechsler Intelligence Scale for Children, (Vo- 
cabulary), 1949, and the Metropolitan Readiness Tests, Form A, 
1964, given during the fall of 1966 at about the same time that the 
Van Alstyne Picture Vocabulary Test was given to the second 
group of children. Raw scores on the Van Alstyne test were 
correlated with scaled or raw scores on these three instruments. 
(See Table 2.) 

Comparison of these results with data in the test manual is not 
really possible because the Stanford-Binet Intelligence Scale was 
not given to the six-year-old children in the general population. 
Correlations of .71 and .60 were reported for four- and five-year- 
olds respectively for the complete Stanford-Binet Intelligence 
Scale. Since only the vocabulary section was used with the Mex- 
ican-American six-year-olds, the comparative meaning of a cor- 
relation of .72 is not known. Also, data from the Wechsler In- 
telligence Scale for Children and the Metropolitan Readiness "Tests 
on the general population of six-year-olds is not available. 

Score Distribution and. Norms. Raw scores for both groups of 


TABLE 2 
Concurrent Validity Data for the Van Alstyne Picture Vocabulary Test 
Other 
Measure 
ye a see 
Test N r x S8 
Stanford-Binet (Vocabula: 
derived "Tage ie 199 .728 48.5 9.2 
WISC (Vocabula: 
scaled score E 198 .664 8.7 3.8 
Metropolitan Readiness 
1 Word aning xd 41 5.6 2.0 
+ Word i x g 4 
2. List zb .434 8.1 2.2 
3. Matching .408 5.9 2.9 
4. Alphabet .366 5.8 3.8 
5. Number .544 8.6 3.8 
8. Copying .239 5.4 2.8 
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six-year-olds were combined in a frequency distribution using 
normalized frequencies. Raw scores ranged from 2 to 58. (S> 
Table 3.) 

Comparison of this distribution with the percentile rank tak 
in the test manual again is not directly comparable because tl. 
manual gives only percentiles for IQ scores. However, the manue 
does include a chart showing mental age for each raw score. 
For six-year-old children (mental age) the range of raw scores 
given for the general population is 44 to 47. The mean for the 
Mexican-American group used in this study was computed from 
the grouped data to be 33.4. Therefore, this normalized dis- 
tribution of scores based on the data acquired from 535 six-year- 
old Mexican-American children might be more useful than the 
manual’s tables when one is trying to compare results of other 
Mexican-American or other culturally disadvantaged children. 

Summary. The Van Alstyne Picture Vocabulary Test (rev.) 
was found to be both reliable and valid for the measure of mental 
ability of a largo group of culturally disadvantaged Mexican- 


TABLE 3 
Normalized Distribution of Van Alstyne Scores for Siz-Y ear-Olds 
Raw Score Normalized 
Interval Frequency Percentile 
57-59 2 99.9 
54-56 4.5 99 
51-53 98 
48-50 16.5 95.7 
45-47 28 91.5 
42-44 41 85 
39-41 54.5 76 
36-38 63.5 65 
33-35 69.5 52.7 
30-32 67 39.8 
ies 59 28 
26 46.5 18.2 
21-23 32 10.8 
18-20 20 7.8 
15-17 11.5 3.0 
12-14 6 L 
Cem 2.5 $ 
5 1 3 
5 2 
02 E a 
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" American six-year-old children living in Tucson, Arizona. À nor- 
-malized frequency distribution of the raw scores showing cor- 
- vesponding percentile ranks was developed for the purpose of 
) offering another set of norms that might be useful when mea- 
| gating other culturally disadvantaged children. 
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AN EVALUATION OF THE SCREENING TEST 
OF ACADEMIC READINESS 


JON MAGOON 
University of Delaware 
AND 
RICHARD C. COX 
University of Pittsburgh 


Tug Screening Test of Academie Readiness (STAR) (Ahr, 
1966) is a group test designed to discriminate between those 
preschoolers who would be acceptable for early admission to 
formal schooling, and those who would not. The instrument is 
intended to aid the school psychologist in identifying a pre- 
schooler’s learning characteristics, social and emotional difficulties, 
and developmental and remedial needs. The purpose of this note 
is to report a wide variety of test and item characteristics for the 
STAR, gathered from a large heterogeneous national sample, and 
to evaluate the utility of this instrument in light of these data. 

The STAR instrument is composed of fifty items, divided into 
subparts denoted as picture-vocabulary (11 items), letters (5 items) , 
copying (3 items), picture description (7 items), human figure 
drawing (1 item), relationships (7 items) , and numbers (11 items). 
Subscores are derived for each of these areas, as well as a total 
Score. Subjects’ responses are usually recorded as crosses or X's 
on alternate choices for items in the test booklet. With the excep- 
tion of item number 32, all items were dichotomously scored 
tight or wrong for these analyses. Item 32 represents a human 
figure drawing task that was rated from zero to 12 points. 

The STAR was administered to a nationwide sample of ap- 
Proximately 4000 first graders and kindergarteners. Other mea- 
sures including the California Test of Mental Maturity were ad- 
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ministered concurrently. The sample subjects were dispersed geo- 
graphically across the United States, derived from both urban 
and exurban environments, and represented widely differing cul- 
tural and racial backgrounds. The total test mean and standard 
deviation (item 32 inclusive), item means and standard deviations, 
and Kuder-Richardson reliability are indicated in Table 1. 


TABLE 1 


Screening Test of Academic Readiness (STAR) Summary 
Statistics (60 Items)* 


Item Mean SD Type 
M ENMSMMNNSUDRNENU  ' Type — — 
1 .89 .32 Picture Vocabulary 
2 .82 39 
3 .90 30 
4 57 +50 
5 .56 +50 
6 -73 44 
7 -76 42 
8 -71 .45 
9 .40 .49 
10 .97 48 
1i +25 43 
12 48 .50 Let 
13 .39 49 per 
14 +49 50 
15 64 48 
16 14 35 
17 -78 i i 
Ui is S Picture Completion 
19 .07 47 
20 .63 48 
21 .62 49 
22 +64 48 i 
23 AT 50 eth: 
24 59 49 
25 à 
A pd $ [5 Picture Description 
27 24 43 
a „54 -50 
2 .88 .32 
i .83 .88 
1 .92 27 
32 
i 7.58 3.15 Human Figure Drawing 
.91 ionshi 
7 2 da Relationships 
.04 : 
36 .83 * 
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37 76 43 
38 .36 .48 
39 TU .42 
40 .51 .50 Numbers 
41 -66 AT 
42 .46 .50 
43 76 43 
44 .35 .48 
45 .63 .48 
46 .39 .49 
4T .45 .50 
48 .82 AT 
49 .18 .89 
Bom no Mou a 
*N = 3947, 


Total Test Mean = 35.11. 
Total Test Standard Deviation = 11.9. 
Total Test Reliability (Kuder-Richardson #20) = .877. 


Most of the items are of middle difficulty, but twelve items are 
80 easy that more than 75 per cent of the examinees passed the 
item, while there are only five items where more than 75 per cent 
fail the item. No one type of item (e.g, picture vocabulary, or 
letters) seems to be consistently very difficult or very easy. 

Following the suggestion of Ahr (1966) in the STAR Eram- 
iners Manual, the items (with the exception of item 32) were 
analyzed with respect to certain grouping criteria such as sex, 
grade placement, socio-economic background, and social background. 
For each sample subject the following data were obtained: 
(a) sex (male-female); (b) grade level (kindergarten-first grade); 
(c) Head Start participation (yes-no); (d) socio-economic class 
(low-middle, based on the oceupation and income of the parent 
or guardian); (e) race (White, Negro, non-White, non-Negro). 
A sample of subjects was selected for an analysis of item 
discrimination based on measured intelligence. The top and bottom 
re than one standard deviation 


16 per cent (subjects scoring mo: À 
away from the mean) on the California Test of Mental Maturity 


Were identified. 

Table 2 shows a categorical breakdown of percentages of sub- 
jects passing respective items, e£ when the total population is 
split into two sex groups, it is noted that approximately 88 per 
cent of the males passed item 1, but roughly 90 per cent of the 
female subjects passed the same item. 
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TABLE 2 
Proportions Responding Correctly to STAR Items 


IQ 
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Class 


Headstart 


Grade 
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TABLE 2 (Continued) 
Proportions Responding Correctly to STAR Items 
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Table 2 is useful in identifying meaningful differences between 
groups being measured by the STAR. Chi-square analyses were 
computed for each breakdown of pupils for each item. Actually 
such an analysis turned out to be somewhat superficial, for many 
essentially meaningless results are statistically significant, because of 
the large numbers involved. There were notable trends, however, 
that were clearly discernible. The most different groups in terms 
of their responses to the STAR were the high and low IQ groups. 
The chi-square results were Significant at the .01 level for all 
items in favor of the high IQ pupils. The test as a whole appeared 
to diseriminate well along intelligence lines insofar as these were 
described by the California Test of Mental Maturity. The best 
discriminators of intelligence Broups were generally items defining 
“letters” and “numbers” tasks, 

Another breakdown that yielded highly significant (statisti- 
cally) results was the kindergarten-first grade grouping. Though 
the differences were not so pronounced as those of the IQ) groups, 
they were all significant at the .01 level favoring, of course, the 
first graders. The most discriminating items were again associated 
with “letters” and “numbers” tasks, 

Racial categories 
tically all items. In 


i inate most among these groups 
represent special “picture vocabulary,” “picture description,” and 
“relationships” tasks. Such results appe 


r ar to be consonant with 
4 view of very important differences inherent in various cultural- 
racial backgrounds, 

categories were gi 
24 of the items in Table 
associated with mental 
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This result was of course highly related, if not redundant, with 
racial, cultural, and other categories, with the exception of sex, of 
course. 

The structure of the STAR from a technical point of view is 
best discussed with reference to the product-moment item inter- 
correlation matrix. The correlation coefficients were generally low 
to moderate in size. Four coefficients were larger than .50, the 
largest being .60; 120 coefficients range between .30 to .50; the 
remainder were low positive. Correlations between similar types of 
items were highest as a rule, but "picture vocabulary" and “picture 
description" items appeared to be negligibly interrelated. 

The moderate to high degree of independence between most 
items, as indicated by the low correlation coefficients, was & reflec- 
tion of a general inconsistency in responses across items. The 
source of the inconsistency was not clearly indicated, but might be 
attributable to two different problems. The first might be described 
as the testing immaturity of the sample, whose numbers were very 
young and inexperienced with test booklets, pencils, test protocol, 
and memory for directions. These very young examinees were un- 
doubtedly easily distracted, and many probably saw little to be 
gained by concentrated attention to the tasks at hand. The 
responses so obtained from such a test-naive sample were in large 
part contingent upon irrelevant mediators having little to do with 
the test items themselves. 

A second source of general examinee inconsistency might be the 
invalidity of the items themselves. From a perusal of the item 
intercorrelation matrix there is generally only a slight suggestion 
that items measured a specific type of skill which could be indicated 
by a composite of item scores. With the notable exception of 
numeration items, the items appeared to function as a set of nearly 
isolated tasks. A principal components factor analysis of the item 
intercorrelation matrix best illustrates this point. The loadings 
larger than .30 on the first six factors in the varimax rotated 
Principal components solution are displayed in Table 3. In the 
case of an unrotated factor solution, nine factors actually had vari- 
ances (\-values) greater than unity and accounted for a total of 41 
Per cent of the item set variance. The fact that 59 per cent of the 
STAR item variance remained unexplained by common factors 
(ie., factors with eigenvalues greater than unity) reflects in an- 
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other more specifie way the relative independence of the STAR 
items. Returning to the rotated pattern in Table 3, it is seen that l 
factor one might be best interpreted as an elementary numerical 
and verbal skills dimension dealing with simple lettering, spelling, 
and numeration skills. The only item aside from these to correlate 
significantly with the first factor was the draw-a-man task item ` 
number 32. The first factor, representing 12 per cent of the 
variance of the set, is a measure of the special numbering, draw- 


ing, and lettering skills ordinarily taught in the first grade or 
kindergarten. 


TABLE 3 


Varimax Rotated Principal Components Structures 
(Columns are factors, items are rows) 


Com- 
munality 


1 2 3 4 


.251 
mos .298 


“310 ns 
+355 
E .236 Picture Vocabular] 
.339  —.372 295 
+815 . i 
—.477 1960 
—.518 : 


320 
31:090 .181 


.560 s .445 
-517 436 Letters 
456, «407 
-390 
472 .315 
-492 -301 | 
+617 .492 Picture Completion 
-584 .467 
-609 .516 


.823 4981 .382 


«551 463 ^ Copyi 
567 .448 hand 


—.39 +233 
ee «240 
—.406 «305 


- 550 +269 Picture Descriptio 
424 +345, 


.558 +245 
.378 
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n .978 .971 Human Figure 
Drawing 
.512 .390 
— .335 974 
331 327 Relationships 
210 
i —.712 635 
à —.907 —.326 349 
n —.704 611 
D 611 431 
NES 1424 
B 486 .314 1435 
8 409 —.339 n 
m 679 .532 Numbers 
5 85 439 
6 709 1578 
4T 684 “566 
/j — .500 .325 —— .418 
B — 310 — 585 
i —.801 A47 — Number Puzzle 


2 3 
3.2 2.8 
^ Loadings less than .30 have not been recorded. 


TThe second rotated factor, only six per cent of the total item 
set variance, appeared to provide & rough indieation of an ability 
to identify common objects or simple concepts, when the content 
of the items is analyzed in relation to the faetor loadings. The 
third factor accounted for another six per cent of the set variance 

| and suggested an ability for identifying rarer objects and concepts. 

The fourth rotated factor, describing for four per cent of the 
variance, was essentially a measure of two items which require 
nearly identical responses in drawing a shortest path between 
two points on a simple map. Factor five represented a “picture 

completion” dimension, and absorbed for five per cent of the 
Variance. Factor six, likewise, appeared to reflect a dimen- 
sion originally designed to be measured independently, i.e., “copy- 
ing” of figures, accounting for four per cent of the item set 
variance. As indicated by the communality of the items (item 
Variance accounted for by general common factors) most items 
had the larger portion of their variance explained by item-specific 
factors. 

It must be concluded from thes 
vided but a very tenuous measure of lettering, 


e analyses that the STAR pro- 
numeration and 
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drawing skills, object and concept identification skills, and picture 
completion and copying skills. These data seemed to indicate that 
this "readiness" test measures best those simple skills that are 
taught through formal systematie instruction. With the exception 
of numeration and lettering, it is questionable whether there is 
sufficient construct of factorial validity to scores derived from STAR 
subtests, for the factor analysis has revealed what is judged to be a 
structure dimensionality unrelated to scholastic achievement. 
Consideration of the STAR, as a preschool screening instru- 
ment, reveals that its factorial does not fulfill many of the criteria 
for a school readiness measure as defined by developmental experts. 
Tig and Ames (1965) found that readiness tests that are highly 
related to IQ, as is the case with the STAR, are inadequate mea- 
sures of a child’s developmental level. From these analyses it 
does not appear that the STAR would be very useful in, as Ahr 
(1966) suggested, identifying “learning problems or social and 
emotional difficulties” that are indicators of developmental level. 
The STAR appeared to be measuring only lettering and numeration 
skills with any great degree of factorial validity, and this circum- 
stance would probably not qualify the instrument as a readiness 
Measure in developmental terms. Finally, since the STAR scores 
were found to be highly related to IQ scores, it is likely that there 


would be little difference between the Screening potential of the 
STAR and an IQ measure, as 
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4 RELATIONSHIPS AMONG STUDY HABITS AND 
ATTITUDES, APTITUDE AND 
GRADE 8 ACHIEVEMENT 


8. B. KHAN anv DENNIS M. ROBERTS 
The Ontario Institute for Studies in Education 


WN and Holtzman (1953, 1955, 1956) developed and vali- 
the Survey of Study Habits and Attitudes—SSHA for pre- 
first year college achievement. Because of the general 
eness of SSHA, Holtzman and Brown (1968) reported on 
levelopment of SSHA—Form H for use with students in 
les 7 through 12. The validities predicting overall grade 
for a large sample of students ranged from .46 in Grade 
.55 in Grade 7. Additional data relevant to the validity of 
at Grade 7 were presented by McGuire, Hindsman, King, 
l Jennings (1961) using an experimental version. The same 
timental version was utilized by Khan (1969) for predicting 
dized achievement in Grade 9. The multiple correlations 
tween scores on eight SSHA factors and six achievement cri- 
1a ranged from .48 to .69. 
The purpose of the present study was to investigate and com- 
e the validity of a modified version of SSHA for predicting 
ardized achievement test scores as well as teachers’ marks at 
Grade 8 level. For general comparison purposes, relationships 
n a standardized aptitude test and each of the two achieve- 
criteria were also examined. 
od—Sample. Hight classes of eighth grade students 
four schools in Peterborough, Ontario, were used. The 
sample consisted of 280 students. Analyses, however, were 
on N = 240 because of pretest-posttest attrition. Both 
5 and females were included in approximately equal numbers. 
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Predictors and criteria, The predictors were a modified version 
of the SSHA and the Canadian Academic Aptitude Test—CAAT2 
The attitude survey, which was an 80 item revision based on 
Khan’s (1969) findings, purportedly measured attitudes towards 
teachers, attitudes towards education, academic motivation, need 
achievement, achievement anxiety, and study habits. The CAAT 
is a standardized general aptitude instrument appropriate to the 
Grade 8 level with verbal (V) and mathematical (M) sections. 

The criteria were (a) the overall average of final year marks as 
assigned by the students’ teachers and (b) scores on three stand- 
ardized achievement tests, The tests were taken from the Do- 
minion Group Achievement Tests—DGAT battery and included 
the Vocabulary (DGAT-V), the Arithmetic Computation 
(DGAT-AC), and Spelling (DGAT-S) tests. 

Procedure and scoring. The SSHA and CAAT instruments 
were administered early in December 1967. The DGAT tests 
were given in June 1968 and teachers’ marks obtained shortly 
thereafter. The CAAT and DGAT tests were scored according 
to keys supplied with each, No correction formula was used. 
Each statement in SSHA was responded to on a 5. 
ranging from “Strongly Agree” to 
sponse data were factor analyzed 
for the estimation of communality 
were extracted and factor scores 
achievement, 

Pies and discussion. SSHA factor scores and aptitude 
ariD'es were correlated with achievement Scores, and in addi- 
tion, stepwise multiple corre 


lations were obtained between SSHA 
Scores and each achievement criterion, , 


Table 1 presents the 
The factors are similar 


-point scale 
"Strongly Disagree." The re- 
by using iterative procedures 
for each item. Seven factors 
were used to predict Grade 8 


—————— 
! The authors thank : ; 
a modified version of Dd Psychological Corporation for permission to use 
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TABLE 1 
SSHA Factors 
Factor Name Item 
I Attitude towards I think teachers usually talk too much, 
Teachers 
I Need Achievement When thinking of going to college, my 
main reason is that it will help me 
be "somebody." 
III Attitude towards I believe that the main job of the 
Education Schools is to teach students things 
that will help them. 
IV Achievement Anxiety I dislike competition in any form. 
v (Indefinable) 
VI Work Methods I prefer to study my lessons alone 
rather than with others. 
VII Academic Motivation I put off doing my written work until 


(Perseverance) 


the last minute. 


achievement, and work methods and academic achievement scores 
are significantly different from zero. Correlations between the 
Temaining factors and achievement criteria are negligable. Results 
of the stepwise regression analysis are presented in Table 3. A 
comparison of the two sets of multiple correlations between 
SSHA factors and achievement criteria indicates that there is very 


TABLE 2 
Intercorrelations of Predictors and Criteria (N = 240) 


Grade 8 Marks 


Note—Decimal points omitted. 
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TABLE 3 


i 'elations between SSHA Factors and Criteria, and Zero Order 
TR Correlations between CAAT and Criteria 


Criteria 
A Grade 8 
Predictor DGAT—V DGAT—S DGAT—AC Marks 
54 
SSHA—I + II + VIe .48b .43 .39 . 
SSHA—(ALL) .54 .46 44 . e 
CAAT—V 76 .56 .59 . 
CAAT—M .54 41 .60 AT 
Besas A O OAA a DR 


* Factors I, II and VI were the first three significant variables in the stepwise regression. 
b AIL correlations significant at .01 level. 


little increase when the remaining four factors are added to 
factors I, II, and VI. The multiple correlations between the three 
significant factors and the Grade 8 achievement criteria range 
from .39 to .54. The correlations between verbal aptitude and 
achievement range from .56 to -76, and between mathematical 
aptitude and achievement the range is from .41 to .60. 

Although aptitude variables are Signifieant predictors of scho- 
lastic achievement, a reasonable amount of criterion variance re- 
mains unexplained, As is typically the case, a search for new 
correlates of academic performance will continue in order to 
identify factors which may improve the prediction of achieve- 
ment. Since nonintellective factors are Susceptible to modification 
through counselling Procedures, positive changes in relevant at- 
titudes and motivation may help to provide a more conducive 
Psychological environment wi 


hich in turn may improve an in- 
dividual’s scholastic perform: 


ance. Some SSHA factors do seem 
to measure significant nonintellective variables and thus deserve 
more systematic investigation in 


natural and experimental settings. 
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THE RELATIONSHIP BETWEEN MANIFEST ANXIETY 
AND MEASURES OF APTITUDE, ACHIEVEMENT, 
AND INTEREST 


FLOYD S. IRVIN! 


Psychosomatic and Psychiatric Institute 
Michael Reese Hospital and Medical Center 


Snom its inception as a clinical tool, the Taylor Manifest Anxiety 
Scale (MAS) has been used to study the relationship between anx- 
iety and a number of important variables. Considerable research 
attention has been given to exploring correlations between anxiety 
and learning (Farber and Spence, 1952; Mandler and Sarason, 
1952) and between anxiety and intellectual performance (Mayzner, 
Sersen, and Tresselt, 1955; Spielberger and Katzenmeyer, 1959; 
Zdep, 1966). More recently, research efforts have been directed at 
compiling personality profiles of persons with differing anxiety 
levels (Golin, Herrpm, Lakota and Reineck, 1967; Weitzner, Stal- 
lone, and Smith, 1967). The stated theoretical position or implicit 
assumption held by MAS researchers is that anxiety is generally a 
debilitating and disruptive phenomenon (Taylor, 1953; Boor and 
Shill, 1967). This position does not make any attempt to qualify 
or identify those areas of human behavior which are subject to 
disruption. On the contrary, the theory suggests that anxiety is 
unilaterally debilitating and interfering. There is reason to wonder, 
however, whether anxiety is as pervasively disruptive as the 
theory suggests. That some areas of human performance may re- 
Main undisturbed or perhaps be even positively enhanced by ele- 
Vated anxiety is not inconceivable. 

The present study was an effort to investigate the currently held 


ee: mee! 
1 The author wishes to acknowledge his indebtedness to Dr. M. Henry Pitts 
for his assistance in the planning of this study. 
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view that manifest anxiety is unilaterally disruptive. More specif- 
ically, the study was directed at examining the relationship among 
variables of scholastic aptitude, academic achievement, and voca- 
tional interest for persons experiencing different levels of anxiety. 

Methodology. The subjects (Ss) were 103 male students enrolled 
in an architecture program at the University of Illinois at Chicago 
during the fall quarter of 1966; 23 of the Ss had to be discarded 
from the sample because of incomplete test data, leaving a total of 
80. The Ss ranged in age from 17 to 20. For the most part Ss were 
classified as freshmen, this group totalling 91 per cent; the remain- 
ing nine per cent were sophomores. 

Each S completed a battery of tests prior to enrolling as a stu- 
dent at the university. These tests included the American College 
Testing (ACT) battery and the Strong Vocational Interest Blank 
(SVIB). Since all Ss were architecture students, the architect scale 
of the SVIB was used as a single index of interest. Indices of aca- 
demie performance were also collected for all Ss. Each S's grade 
point average was the mean of all grades earned during the fall quar- 
ter of 1966, the period during which the MAS yas administered. The 
MAS was given to all first year and transferring architecture stu- 
ey who were participating in an experimental orientation pro- 

On the basis of the MAS Scores, the 80 Ss were arranged in 
e cn: order and divided into three anxiety groups. The upper 

b per cent of the distribution 
anxiety (HA, N = 20), the middle 50 per cent (Group II) was 
= 40) and the lower 25 per 
nxiety (LA, N = 20). The 
IA: 9-18; LA: 0-8. 


d deviations in some compari- 


I-II & I-III) for scores re- 


IA Ss were moderately superior to both HA and LA Ss; these 
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differences were not confirmed beyond chance occurrence. Table 1 
rather clearly demonstrates, however, that HA Ss were generally 
less effective with respect to aptitude and grade performance than 
either IA or LA Ss. 
A t ratio was also computed between HA, IA and LA groups for 
scores received on the architect scale of the SVIB. Elevated scores 
| on this scale were partially in accord with predictions. The finding 
| that Ss characterized by LA received elevated scores on the archi- 
tect scale, was confirmed at the .05 level. Contrary to prediction, 
however, was the finding, confirmed at the .01 level, that Ss char- 
acterized by HA also obtained elevated scores on the interest index. 
Discussion. The findings presented in Table 1 suggest that non- | 
anxious persons perform better than anxious persons with respect, 
to aptitude tests and academic achievement. These results are par- | 
tially in accord with current theory which suggests that anxiety isa 
debilitating force. The data diverge from a priori expectations, 
however, in terms of vocational interest expressed by the highly 
anxious group of students. It seemed reasonable to assume that the 
anxious student, characterized by tension and confusion, would have 
difficulty in sorting out his likes and dislikes and in reporting them 
on an interest inventory. However, the findings of this study pro- 
vides evidence to the contrary, since anxious students as a group 
received a mean score on the interest index which was comparable 
to that calculated for the nonanxious students (see Table 1). 


TABLE 1 


Means, Standard Deviations and t Ratios for Aptitude, Achievement, and Interest Measures of 
High Ansious (HA), Intermediate Anxious (TA), and Low Anzious (LA) Subjects 


Group I Group II Group III 
HA x IA LA 
uir. dud AE. 


————— 


t i i 


rican College 
rating Battery 21.75 2.40 .49 22.15 3.12 1.71 23.55 2.48 2.28* 
oint 


frage 3:255: LERRA E S Doa EE nom) Betis Pome te 


38.80 8.82 .50 


terest Blank — 40.90 8.31 3.59** 32.65 7.34  2.81* 


ie cSym scores are for architect scale only. 
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One possible way of understanding the elevated interest index for 
the anxious individual is to consider carefully his situation. As an 
anxious person who enters college and enrolls in a professional pro- 
gram, viz., architecture, he may be desperate to impose structure 
on and obtain meaning for his ambiguous and confusing initial ex- 
periences, Indeed, such a person might very well be inclined to re- 
flect a highly developed interest in his chosen career as a means of 
controlling at least one aspect of his environment. Elevated anx- 
iety, then, instead of reducing the ability to make sharp differentia- 
tions in interest, appears to move a significant proportion of the 
architecture students to make even clearer vocational distinctions. 

Generalizations about the relationship between interest and anx- 
lety level may be limited by questions concerning the usefulness 
of the MAS as a single criterion for assessing complex interactions. 
Tt must be kept in mind that students who have performed well 
on aptitude tests, been accepted into and made publie commitment 
to a professional educational program, are not likely to permit their 
anxieties to emerge on a self-report inventory. Since endorsing items 
on the MAS means admitting to problems and self-doubts, the stu- 
dent Who aspires toward success may suppress actively his anxieties 
in such a test situation. 

Several conclusions which emerge from this discussion demand 
“oan ae ni ee Mea on level of anxiety 
warranted. Second, demographic p A i E 
Ein ER is Enon = motivational factors unique 
li adrainistered i£. golf à ae into account when the MAS 

dministered if se report distortion is to be minimized. Third, 


anxiety seems to be more selective and specific than current theory 


academic performance, and 
urrent theory, it was predicted 
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7 
concerning aptitude and achievement but not interest. Implications 
for these findings were discussed. It was suggested that anxiety 
may not be so unilaterally disruptive and debilitating as current 
theory indicates. It was further suggested that anxiety may be a 
motivating force for vocational decision-making. 
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AN EXPERIMENTAL VALIDATION STUDY OF 
THE PURDUE RATING SCALE FOR INSTRUCTION* 


DONALD R. MIKLICH? 
University of Hawaii 


Recentuy, Bresler (1968) presented evidence that professors who 
publish and who receive extramural grants are considered by their 
students to be better teachers. Whether Bresler’s data bear on any- 
thing more than student morale depends upon the validity of stu- 
dent ratings, a validity which is often questioned. Bresler cited no 
evidence that student ratings are valid, and an extensive literature 
search by this writer has failed to discover any experimental test of 
the validity of students’ ratings. There do exist two correlational 
validity studies for the Purdue Rating Scale for Instruction, PRSI, 
(Remmers, 1960). In the first, Remmers, Martin, and Elliott (1949) 
defined “good” and “poor” instructors as those whose students 
received higher or lower grades than would have been predicted 
from the students placement tests. The “good” teachers were given 


Superior ratings. Since it has been shown that instructors who grade 


leniently receive high student ratings (Anikeef, 1953), and that 
high ratings to their 


students who perform better in a class give 
instructors (e.g., Stewart and Malpass, 1966), this procedure may 
not have accurately identified “good” teachers. In the second study, 
Elliott (reported in Remmers, 1960) found that instructors with 
five or more years experience were given higher student ratings than 
those with less experience. This js more convincing, but still only 


ded by the Associated Students of 


Dre. 
1Th i i à 
e materials for this study were i danai was performed with the 


the University of Hawaii. Most of the 

computer facilities of the UH Social Sciences Research Institute, 
?Now at Children's Asthma Research Institute and Hospital, Denver, 
lorado, 
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correlational. It was not shown that the students were responding 
to the more experienced instructors’ purported superior teaching, 
rather than to some other factor. Since it is generally agreed that 
younger instructors grade rigorously, Elliott’s finding may simply 
be another demonstration of the effect of grades on student ratings. 

This paper reports a “natural experiment” which allowed a test 
of students’ ability to discriminate validly better prepared, more 
experienced, and more interested teaching while controlling other 
variables associated with the instructor. 


The second part of the PRSI contains 16 5-point, single item 
, With two exceptions, 


£ two psychologists predicted the 

Which would be expected between the 

^mi were valid. It was expected that since 

] more enjoyable to its students, halo 

Pir vier &ppear in the form of superior ratings by members of 
on scales where no differences were predicted. 

Results. All distributions Were skewed and for many scales the 


variances were significantly di : 
made SEL ok ipla Comparisons were, therefore, 
nonparametric U, test, (Siegel, 


diff idi 
erences even when the validity hypothesis did not, conservative 


EN 
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hypotheses (H,) in some cases and acceptance in others, so two p 
values were used. When p « .05, H, was rejected. When p > .20 
H, was accepted. 

The results are presented in Table 1. All four scales where dif- 
ferences were predicted show differences in the predicted direction. 
Presentation of Subject Matter (5) and Interest in Subject (1), 
however, are indeterminant since their p values (.17 and .13, 
respectively) are in that area where H, can neither be accepted 
nor rejected. The differences with Self-Reliance and Confidence (7) 
and Fairness in Grading (3), are significant. The results with the 
latter scale are particularly interesting since the direction of the 
difference is opposite to that expected from halo effect. The intro- 
ductory class had been given an experimental mid-term on which 


TABLE 1 


Means, Predicted Differences, and the Mann-Whitney U Tests Comparing 
Student Ratings on Part 1 of the Purdue Rating 


Scale for Instruction 
Class Means* 
Statistics 
Introductory N = 65 Lin 
Scale N-41 to 674 Prediction? U, p 
1. Interest in subject 2.12 1.82 + 1.44 .13 
2. Sympathetic Atti- 
tude toward 
A Student 2.22 1.82 0 1.51 .07 
. Fairness in 
Grading 1.51 1.92 -= —1.92 .03 
4. Liberal and Pro- 
gressive Attitude 1.93 2.03 0 —.49 .59 
5. Presentation of 5 
4 p Pubs Matter 3.49 3.19 + —.94 1 
. Sense of Propor- 
E. tion and Humor 1.88 1.67 0 JU. 
- Self-Reliance and 
A ones 2.41 1.75 + 2.69 .004 
- Personal Peculi- 
: arities 2.83 2.34 0 1.47 .07 
. Personal A) 
u nid or 2.56 2,36 0 07. .47 
0. Stimulating Intel- 
lectual Curiosity 3.41 3.55 0 < —.51 .70 


* Scores ranged from 1 (best) to 10 (worst). This scoring is different from that normally used 
With the PRSI, and was used to facilitate the computer based analysis. 
b Positive difference = +; No difference = 0; and Negative difference = —. 


n 
All 3 
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an item analysis was performed. Items that were not demonstrably 
reliable and valid were discarded before computing grades. This 
was carefully explained to the class so they had reason to rate their 
grading as being particularly fair. 

*No difference' predictions were confirmed with four scales. These 
were Liberal and Progressive Attitude (4), Sense of Proportion and 
Humor (6), Personal Appearance (9), and Stimulating Intellectual 
Curiosity (10). For at least two scales, these results reflect not simply 
passive agreement due to the lack of a discriminant, but discrim- 
inating agreements. Both classes rated the instructor high on scale six 
and low on scale nine. The instructor, his colleagues, and indeed his 
wife agreed that this was objective and valid. The two remaining 
scales for which no differences were predicted, Sympathetic Attitude 
toward Student (2) and Personal Peculiarities (8), both showed 
tendencies which while they do not reach customary significance 
levels, suggest the presence of some halo effect. 

The results with two scales from the second part of the PRSI are 
of interest. In the Statistics class, administrative difficulties forced 
the use of an inappropriate text. The Suitability of Assigned Text- 
book (17) seale reflect, this with a 1.52 point difference on 5-point 
scale (U, = 6.43, p < 107). No difference was found on Overall 
Rating of the Instructor (26), notwithstanding the expected halo 
effect (U: = .26, p = 40). This Suggests a definite lack of validity 
in this particular scale. The PRSI norms (Remmers, 1960) show 
that students do not use the negative end of this scale. With a five- 
point scale, this so reduced the variance of the ratings that only the 
grossest differences can be detected. 

Discussion. The number of raters in this study is ample to provide 
for safe generalization to other raters, 
Were these results singular or diffe 
very tentative conclusions could b 
results agree with those of Remmers 
mers, 1960), we may conclude with s 


, but there was only one ratee. 
rent from other studies, only 
e drawn. However, since these 


ent to tate such factors as, for example, 
of his subject matter, which the student 
cted to be able to judge. The author does 


the instructor’s knowledge 
cannot reasonably be expe 


* 


DONALD R. MIKLICH 967 


not mean to suggest that these results imply that students could 
validly rate such factors. Indeed, he doubts that they could. 
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A STUDY OF EVALUATION IN SOCIAL SCIENCE 
ORIENTED EDUCATIONAL PSYCHOLOGY COURSES! 


JANET W. JOHNSON ax» ROBERT S. WALDROP 
University of Maryland 


Tue Redwood School Test (Hurst, Ronning, and Bartlett, 1961) 
is designed to measure the ability to apply principles of educational 
psychology. The Test was constructed to measure a factor, desig- 
nated as application ability, identified through a factor analysis 
of evaluation variables in an elementary psychology course. In an 
investigation based on evidence that class tests typically do not 
measure all the desirable outcomes of educational psychology courses 
Bartlett, Ronning, and Hurst (1960) identified the factors of factual 
knowledge, application, and general achievement ability. 

The Redwood School Test (RST) is a case study followed by 117 
statements with which the examinee agrees or disagrees. The state- 
ments are classified by the test authors into diagnostic and remedial 
items. The major evidence of validity of RST as a measure of appli- 
cation ability has come from factor analysis (Hurst, Ronning, and 
Bartlett, 1961) and from a construct validity study (Braun, 1965). 
RST had substantial loadings on the application ability factor 
identified in the Bartlett, Ronning, and Hurst study (1960). Hurst, 
Ronning, and Bartlett (1961) reported correlations ranging from 
21 to .35 with measures of leadership, general achievement, and 
course achievement. They likewise found a higher correlation be- 
tween RST and the Minnesota Teacher Attitude Inventory (MTAI) 
given at the end of the course than with the MTAI given at the 
beginning, which they interpreted as indicating that attitude change 
takes place with an increase in the ability to apply knowledge of 


eee 1 
1 The authors gratefully acknowledge the support of the Computer Science 
Center of the University of Maryland for data anlysis. 
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facts and principles. Braun (1965) in the validity study cited above, 
compared improvement in RST scores and reported significantly 
more improvement between precourse and posteourse scores for 
students in an educational psychology course than in a general 
psychology course. However, in a second educational psychology 
course which was combined with student teaching, a nonsignificant 
loss was found between precourse and postcourse scores, Braun 
interpreted his results as supporting the construct validity of the 
test, attributing the failure of the combined teaching-educational 
psychology group to show significant improvement to shortcom- 
ings in the nature of the course. In all of these studies, the education 
psychology Ss were from courses in which the vast majority of the 
students were in various teacher training programs. 

Courses in educational psychology are also offered for the student 
who is interested in the study of learning in educational settings, 
but who is not primarily concerned with teacher preparation. The 
present study was undertaken to investigate the applicability of 
RST as a means of evaluation i 
upon the psychologist’s 
educative process. The a 


performance, such as gen- 

„edge, amount of previous psychological 

cationally related attitudes. 

b. Ss were 193 students in a junior-senior 
ology course offered thr 

of Psychology Over a period of two regular eae at 

mer session. These students were enrolled in five sections taught by 


I8? There were iw 
F " © regular semesters sec- 
tions, two summer Session sections, 


: and one campus night section. 
All students were required to have Previously completed at least one 


Course work, and edu 
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psyehology course. Of the 193 Ss who took RST, nine failed to com- 
plete the test in usable form. The final sample of 184 Ss was divided 
into two groups: Group I (posteourse test only), 31 women and 24 
men who were tested with RST at the completion of the educational 
psychology course, and Group II (precourse and postcourse tests), 
78 women and 51 men who were tested with RST at the beginning 
and at the end of the course. In Group II, 67 women and 39 men 
completed both the pre- and post-RST administrations. 

Table I indicates the major field and career plans of 162 Ss. At the 
beginning of the course each student was asked to complete a card 
giving information about the number and content of previous psy- 
chology courses, his reasons for taking the current course, career 
plans, and major field of study. Eighty-seven responses were classi- 
fiable into career plan categories. Teaching was involved in 37 per 
cent of the stated career plans; in addition 21 per cent of the Ss 
listed careers related to psychology. Major field of study for 49 per 
cent of the Ss was either psychology or sociology while less than 10 
per cent were enrolled in education programs. 


TABLE 1 
Major Fields of Study and Career Plans of 162 
Educational Psychology Students 
Career plans 
Major field of study N 9 (N = 87) N % 
Liberal Arts Teaching iid 32 30.8 
Psycholo 47 29.0 Health professions 
ud 32 19.8 ŒD, D.D.S., R.N.) 19 21.8 
Other social sciences 5945] jJ 
Rus and humanities 16 9.9 Psychological fields 18 20.7 
peech and speech therapy 8 4.9 £ 
Premedical or preden! 4 25 Social Work 8 9.2 
Natural science 4 2.5 do duis 
Mathematics and physical Other i 
Science 6 38.7 
Liberal arts-undecided 1 0.6 
Education 16 9.9 
ursing 13 8.0 
Home Economies B) 8.1 
Business 4 2.5 
Agriculture 232 
1 0.6 
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Measures. In addition to the Redwood School Test, described 
above, the following additional measures were used: 

FC-TAI—Forced Choice-Teacher Attitude Inventory (Bartlett, 
1966). This is a forced choice form of the MTAI designed to measure 
school-related attitudes. The S indicates for each group of four 
statements, the two with which he agrees most. Each S makes a 
total of 42 choices. 

TOBKIP—Test of Basic Knowledge in Psychology. This is an ex- 
perimental 25-item multiple-choice screening test of basic psycho- 
logical knowledge developed by the authors as a screening test for 
courses requiring a course in general psychology as background. 
When administered to students completing a general psychology 
course a significant relationship with final course grade was ob- 
tained. In addition, precourse TOBKIP scores were related to final 
course grade in several sophomore and junior level psychology 
courses. 

BIC—Biographical Information Card. This is a brief question- 
naire of student background information as described above. 

Procedure. Group I Ss, at the final class meetings of the courses, 
were requested to take RST and the FC-TAI as part of a research 
project, and were told that the results would not influence their 
grade in the course. In addition, 33 of the 55 Ss had previously 
completed the biographical information cards. Group II Ss at the 
initial course class meetings completed the biographical information 
cards, RST, FC-TAI and TOBKIP. At the final meeting of the 
course, these Ss again took RST and FC-TAT. Pretest scores were 


obtained at the end of the course, were cor- 
ade for Group I (r = 369, df = 53, p < 


s Scores correla ES 
-01) with course grade for Group IT. pated 244 (dj = 108, p< 
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Table 2 indicates the mean and standard deviation of each vari- 
able for each group. For Group I, means and standard deviations of 
grade, number of previously completed psychology courses, and 
post-course RST and FC-TAI scores are given in the first column. 
In the second column comparable statisties are given for the same 
variables, for precourse RST and FC-TAI, and for TOBKIP. The 
third column provides comparable data for those in Group II who 
were precourse tested only. 

There was no significant difference in RST end-of-course per- 
formance between Group I and Group II. Within Group II there 
was a mean gain of 1.15 points from the precourse test to the post- 
course test. This gain was significant at the .05 level (¢ = 2.31, df 
= 104, p < .05). Pre- and postcourse scores correlated .764 (df = 
104, p < .01). Mean performance on precourse RST within Group 
II was found to differ between those who completed both testings 
and those who took only the first test (t = 2.92, df = 123, p < .01). 
Only four Ss who failed to take the precourse RST took the post- 
course test. 

Table 3 furnishes the correlations between the RST scores and 
sex, number of pyschology courses, TOBKIP scores, course grades, 
and FC-TAI scores. In addition, for comparative purposes, cor- 
relations between RST and MTAI scores from the Hurst, Ronning, 
and Bartlett (1961) study are given. 


Discussion 
Correlations between posteourse RST scores and course grades 


TABLE 2 
Means and Standard Deviations 


Group I Group II (Pre only)* 


Variable N M SD N M SD N M SD 
Variable ,, N, (M. aoi iuc Sin eee 


Course grade (on 
pci seal) 55 2.29 1.08 106 2.29 .97 15 1.73 1.22 


umber of psy- 

chology courses .00 2.88 104 3.59 2.13 19 4.05 2.30 
PreRST ^ 55 400 2.38 ie 78.14 7.15 19 72.68 8.85 
Pre-FC-TAI = ln 2 104 23.78 4.27 19 23.32 4.06 
TOBKIP — 108 14.85 3.30 10 14.00 4.85 
Post- 60 106 79.29 7.79 — — — 
ETC- pL 23.65 3.77 2 25.00 7.07 


Post-FC-TAT 44 923.09 4.02 96 ; 
PecrO-rAI «ales o MD a E 


* Failed to take the posttest. 


974 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 3 


Correlations between Postcourse Scores for Group I and between Pre- and Post-Course 
Scores for Group II on RST with other Variables along with Comparable 
Correlations between RST and MTAI (Hurst, 

Ronning, and Bartlett, 1961) 


Group I Group II Group II Hurst, et al. 


(Post) (Pre) (Post) (N = 100) 
Variable MAAN r N r N r 
Bex —.07 wll 125 .10 110 Ca ESE 
FC-TAI (pre) —— .25** 123 .35** 108 .27 (MTAI 
precourse) 
FC-TAI (post) .22 44  ,42* 98 .38** 100 .35 (MTAI 
postcourse) 
pursuit pcre .99** 121  .39** 107 
rade in Course é 339%") 12 .24** 
Previous Number ogni inn 
of Psychology 
Courses —.18 31 .06 123 .09 108 | —————— 
*p < .05. 
“p « 01. 


for Group I and Group II were significant at the .05 and .01 levels, 
respectively. The use of course grade as & criterion involves the 
problems of complexity, unknown reliability, and validity. This 
choice of criterion was based on the consideration that course grade 
is the practical measure of achievement in multiple sectioned courses 
Which do not have common class examinations, The correlations 
found between RST scores and course grades, in this study, are of 
hides same magnitude as Hurst, Ronning and Bartlett (1961) 
lon. CURES eid and a multiple choice examination based on 
uis [ovd Si measure of ability of a potential teacher 
sus: iii eia es ina schoolroom situation. In educational 
daa ini conducted primarily for students with social 
dion in a 9 focus is less likely to be on this type of applica- 
RA lne Wes situation. To determine whether the RST would 
Avat PP sah to other evaluations in a course focused on 
ospiti wu ological variables in an educational setting, a 
Mag Miner and Postcourse performance on the RST 
the actual eae c eine Statistically significant (p < .05), 
practical sd oe qund be of limited 


These results would seem to be consistent with those of Braun 
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(1965). He found a gain on the RST of 5.83 points (p < .001) for a 
group of students in a teacher training program enrolled in an edu- 
cational psychology course, as opposed to a gain of 2.14 points 
(p < .01) for general psychology students. The high initial level of 
RST performance of the Ss in the present study, although not sig- 
nificantly related to the specific number of previous psychology 
courses, may reflect a more extensive psychological background and 
account for failure to find as great an absolute amount of improve- 
ment after one additional course as Braun reported. Additionally, 
almost two-thirds of the students in this study did not anticipate 
elementary or secondary school teaching careers. The teacher-ori- 
ented case study approach of the RST may seem less relevant, and 
hence the effects of differences in test-taking motivation may be 
more heterogenously distributed in this type of group. 

In comparing the relationship of the RST with other variables, 
Hurst, Ronning, and Bartlett (1961) noted that postcourse RST 
scores correlated .27 with precourse MTAI and .35 with post- 
course MTAL. Table 3 indicates comparable results of the RST with 
the FC-TAI as obtained in this study. It will also be noted that 
precourse RST scores correlated .25 with precourse FC-TAI and 
that the postcourse scores correlated .38. This observation would 
appear to support the trend for attitude and applicational ability 
scores to be less related prior to course experience than after such 
experience. 

Both precourse and postcourse RST performances were signif- 
icantly related to previously acquired psychological knowledge as 
measured by scores on TOBKIP with no change in the degree of 
relationship after course experience. 


Thus evidence for the construct validity of RST (from a sta- 


tistical point of view) is to be found in the fact that RST scores 
de and that a significant 


correlated significantly with final course gra 


improvement was found between precourse and postcourse per- 


formance. The same pattern of relationship was also found between 
Hurst, Ronning and Bart- 


RST and an attitude measure reported by 

lett (1961). Performance on the RST was also significantly related 
to previously acquired psychological knowledge. However the small 
difference in RST scores, between precourse and postcourse tests, 
as well as the low correlations with course grade would indicate 
caution in using the RST as an evaluative instrument in educa- 
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tional psychology courses in social science curricula. Rather the 
RST would seem more appropriate for measuring proficiency in 
teacher preparation for which it was designed. Until further re- 
search has been done comparing RST performance in educational 
psychology courses with differing goals, the authors question its 
validity for evaluation in social science oriented educational psy- 
chology courses. 
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OVERALL MEASURES OF SELF-ACTUALIZATION 
DERIVED FROM THE PERSONAL ORIENTATION 
INVENTORY! ? 


VERNON J. DAMM 
University of Portland 


TuE Personal Orientation Inventory (POI) was designed by 
Everett L. Shostrom (1964, 1966) as a group test for measuring 
self-actualization. Its validity is based primarily upon clinically 
judged self-actualized vs. non-self-actualized subjects (Shostrom, 
1966). Other studies report the test’s ability to differentiate be- 
tween such things as pre- and post-sensitivity training (Shostrom, 
1964), stages of psychotherapy involvement (Shostrom and Knapp, 
1966), levels of performance on a neuroticism inventory (Knapp, 
1965), and achieving vs. underachieving college freshmen (LeMay 
and Damm, 1968). 

The test consists of 150 two-choice (paired opposites) compara- 
tive value judgments which can be broken down into the following 
scales: Inner Directed (I), Time Competent (Tc), Self-Actualizing 
Value (SAV), Existentiality (Ex), Feeling Reactivity (Fr), Spon- 
taneity (S), Self Regard (Sr), Self Acceptance (Sa), Nature of Man 
(Nc), Synergy (Sy), Acceptance of Aggression (A), and Capac- 
ity for Intimate Contact (C). 

Although differential profile patterns have been demonstrated to 
allow for predictive judgments, it has not yet been demonstrated 
which of a number of alternative methods may provide the best 
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overall measure of self-actualization. When trying to deal with this 
problem, certain factors must be taken into consideration: 

First, there is considerable range in the number of items per 
scale, varying from nine to 127. The 7 scale includes 127 items and 
the Tc scale includes the remaining 23 of the total 150 items in the 
inventory. The remaining scales include from nine to 31 items each. 

Second, there is a differential amount of overlapping of inter- 
scale items, varying from zero to 28. Each scale, excluding the Tc 
scale, overlaps with the I scale with the exception of from zero to 
three items per scale. Therefore, since the I scale overlaps most 
heavily with all other scales, it would appear that this scale should 
most likely represent an overall measure of the POI. Nevertheless, 
it must still be taken into consideration that the Tc scale contains 
23 items, none of which overlap with the I scale. 

Third, there are considerably different profile patterns which 
characterize different population groups. 

It is obvious from this that the structure of the inventory does 
not dictate that which might most effectively provide for an over- 
all measure of self-actualization. Furthermore, there are no re- 
ported established weighting procedures for combining the individual 
Scales outside of Shostrom’s (1966) claim that “when a quick esti- 
mate is desired of the examinees level of self-actualization, the Time 
Competence (Tc) and Inner Directed (I) scales only may be 
scored [p. 7]." The manual does not explain, however, what the 
relationship of the I to the Tc scale should be. It, does not state 
Whether these scales should be regarded s 
equally, or be combined in some other way. 
Towed this down further by maintaining tha 
poses, the I scale (inner directed) scores were 
estimate of self-actualization [p. 171].” 
at iik Ri meum is that the combined I and Tc 
Miia: Mlle i ha 5s desired choice for each item in 
occur: for item 16, the Fr 8 ah Ay poe con ipe 
and Sr scales score choice pcd i ores choice “A,” whereas the I 
score choice “A,” where D e a ende 

, as the A scale scores choice “B” ; for item 


115, the I and Ne scales Score choice “A,” whereas the A scale 


scores choice “B.” With this dise: i Rene age 
sible to say that Tepancy in scoring it is not pos- 


there is & Siven scored choice for each item re- 


eparately, weighted 
Knapp (1965) nar- 
& "for present pur- 
used as the best single 
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gardless of scale; also it is not possible to sum the I and Tc scale 
scores and thereby be able to score all 150 items in a positive 
direction. 

Before an overall measure can be accepted without reservation 
such a scale should be validated against some external criterion. 
Until such validation procedure is undertaken, a test administrator 
must choose one of a number of available alternatives for obtaining 
an overall POI score. This report is based upon a study which 
explored some of these possibilities. 

Method. Ninety-five male and 113 female students from Willa- 
mette High School, Eugene, Oregon, were used as Ss. They were 
administered the POI during a 50-minute study-hall period by the 
study-hall teacher. When a S scored either both or neither item 
alternative for more than three items, he was eliminated from the 
sample. From these data four methods for deriving an overall 
measure of self-actualization were examined. 

For each of the 12 original scales a raw score distribution was 
derived from the scores of the 208 Ss, which was then converted into 
a standard score distribution. The 12 standard score scale values 
for each S were then averaged, which made up a Standard Score: 
Average Overall Scale (S:AOS). Part of the rationale for develop- 
ing this scale originated out of the report by Shostrom (1966) that 
“self-actualized groups are significantly higher on all scales and non- 
self-actualized groups tend to be lower on all scales [p. 21]." 

A Standard Score: Inner Directed-Time Competent Scale (S:I- 
Tc) was developed by combining the standard scores of the I and 
Tc scales for each S. It was noted earlier that these two scales cover 
all 150 items of the POI, and since there was a vast discrepancy in 
the number of items per scale, namely, 127 vs. 23, respectively, it 
was regarded as important to consider the use of standard scores in 
Opposition to raw scores for combining these scales. 

However, to determine whether this procedure was in fact nec- 
essary, a Raw Score: Inner Directed-Time Competent Scale (R-I- 
Tc) was developed by combining the raw scores of the I and Tc 
Scales, 

The Raw Score: Inner Directed Scale (R:I) merely consisted of 
the original J scale of the POI. 

Results and conclusion. Table 1 contains the product moment 
Correlation coefficients obtained from a comparison of each of the 
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above-mentioned scales with each other, as well as the respective 
correlations between each of these scales and the remaining POI 
scales. 

An examination of this table shows that the potentially overall 
POI scales of S:408, S:I-Tc, R:I-Tc, and R-I yielded inter-scale 
correlation coefficients of from .87 to .98, which are considerably 
high in terms of predictability from one scale to another. The aver- 
age of the coefficients between any given scale and the other three 
overall scales is highest for the R:I-T'c scale, being .97, and the 
lowest for the R:I scale, being .93. This, however, is not a sufficiently 
large diserepaney to be of any practical significance. Also, by in- 
spection there seems to be little practical difference in terms of the 
comparative size of the coefficients between any given overall scale 
and the remaining 11 POI scales. The average of these for each of 
the overall scales is .64 for the S:AOS, .59 for the S:I-Tc, .62 for 
the R:I-T'c, and .60 for the R-I. 

All of this points to the conclusion that an overall measure of the 
POI can probably be best obtained by using the raw score of the I 
scale, or by combining the raw scores of the I and Te scales. No 
significant increase in predictability is obtained by converting raw 
score data to standard scores for combining scales. 
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RELATIONSHIPS BETWEEN A MEASURE OF 
"SENSATION-SEEKING" AND PERSONAL PREFERENCE 
SCHEDULE NEED SCALES 


CARRIE WHERRY WATERS 


Center for Psychological Services 
Ohio University 


AND 
L. K. WATERS 
Ohio University 


Tum Sensation Seeking Scale (SSS; Zuckerman, Kolin, Price, 
and Zoob, 1964) was developed in an attempt to quantify the con- 
cept of optimal level of stimulation. Several studies have explored 
the construct validity of the SSS in terms of its relationships to 
| other personal characteristics scales, volunteering for “unusual” 
experiments, and performance in a simulated gambling game. 
| (Zuckerman and Link, 1968, have summarized much of the relevant 
| literature). 
| Some of the “needs” purportedly measured by the Personal Pref- 
| erence Schedule (PPS; Edwards, 1959) seem conceptually related 
to “sensation seeking," and Zuckerman and Link (1968) reported 
| Substantial correlations between the SSS and several PPS scales 
| fora small sample (n = 40) of male Ss. The purposes of the present 
study were (a) to explore the relationship between the SSS and 
| PPS scales for both males and females and (b) to determine the 
Consistency of the relationships across two samples of each sex. 
Method. Two samples of introductory psychology students were 
administered the SSS and PPS during regular class sessions. Only the 
22 items common to the male and female keys of the SSS were 
given. The first sample consisted of 122 females and 120 males; 
the second sample, tested about one year later, consisted of 76 fe- 
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males and 69 males. For both samples, the SSS was administered 
first and about two weeks separated the SSS and PPS administra- 
tions. 

Results and discussion. Table 1 gives the correlations between 
the SSS and PPS scales for males and females in each sample. The 
coefficients, in general, are considerably lower than those reported 
by Zuckerman and Link (1968). The two week Separation of the 
administrations in the present study may have been a factor in the 
differences in magnitude of the coefficients in the two studies. For 
the males, the only significant relationships between the SSS and 
PPS scales consistent across both samples were for PPS Change 
(+) and PPS Order (—). The female samples showed several 
more significant correlations than the male samples. Significant 
positive relationships were found in both female samples for PPS 
scales: Change, Autonomy, Exhibitionism, Dominance, and Aggres- 
sion. Consistent negative correlations were obtained for Order, 
Succorance, and Deference. 

From the PPS scale descriptions, the two scales that would be 
expected to be most highly related to the SSS were the one’s that 
showed the largest correlations; Change (+) and Order (—). The 


TABLE 1 
Correlations between the SSS and PPS Scales 
Females Males 
PAUSED Uo mda) dy 
SampleI Sample II SampleI Sample II 
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additional scales that correlated positively and negatively with the 
SSS in the female samples were also found to correlate significantly 
and in the same direction by Zuckerman and Link (1968). Al- 
though the correlations obtained in the present study were of quite 
modest magnitudes, the pattern of relationships was consistent with 
conceptions of the sensation-seeker as one who “is independent, 
unconventional, and low in social values and conformity, needs 
variety, and does not value order or routine” (Zuckerman and 
Link, 1968, p. 424). Differences in the number of significant rela- 
tionships for Zuckerman and Link’s males and those in the present 
study, and the differences between males and females reported 
here, need more exploration. 
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OCCUPATIONAL INTEREST INVENTORY "FIELDS OF 
INTERESTS" SCORES AND MAJOR FIELD OF STUDY 


ROBERT F. STAHMANN 
University of Iowa 


Are Occupational Interest Inventory (OII) Fields of Interests 
scores valid for predicting university major field of study? The in- 
lerpretation of the OII implies this validity (Lee and Thorpe, 
1956) ; and because evidence for it is lacking, the present study ex- 
plored such predictive validity. 

Dunn (1962) found that students who had succeeded in various 
| fields of study had correspondingly different OII total profiles as 

freshmen. In another study, Stahmann and Wallen (1966) found 

that OII total profiles, in combination with achievement test scores 
| obtained from freshman entrance testing, effectively predicted major 
field of study at graduation. However, no study was found report- 
ing the predictive validity of OII Fields of Interests scores alone. 

The Occupational Interest Inventory reports interest in six fields: 
Personal-Social, Natural, Mechanical, Business, The Arts, and The 
J Sciences; in three types: Verbal, Manipulative, and Computational; 

and a Level of Interests score. The test is often used in counseling 
situations requiring an instrument delineating broad vocational cate- 
| gories (Goldman, 1961), although the Manual suggests à system for 
describing more specific occupational pursuits based upon the Fields 
of Interests scores (Lee and Thorpe, 1956). 

į Purpose. Ths purposes of this study were (a) to compare the 
| like scores of criterion and cross-validation samples of male and 
| female students upon the six Fields of Interests variables obtained 
| from a university entrance battery and (b) to ascertain their va- 
| lidities for predicting major field of study at graduation. All data 
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were obtained from the Advanced Level of the 1956 revision of the 
or. 

Procedure—Samples. University of Utah graduates of 1962 
through 1966, in selected fields of study, served as the population 
from which the samples were drawn for the study. The students had 
taken the OII as part of the freshman entrance examination battery 
before beginning university work. The design required two samples 
for each field of study, a criterion (experimental) sample upon 
which the predictions were derived, and a cross-validation sample 
for which the predictions as to major field of study were made. 

The numbers of graduates for whom freshman OII data were 
available varied among the fields of study. Sample sizes of 50 were 


desired for the criterion samples and were obtained for all of the . 


men’s fields except pharmacy in which a sample size of 25 was ob- 
tained. In the women’s major fields criterion sample sizes of 45 
each were obtained for nursing and letters and science while an N 
of 50 was obtained for elementary education. (See Tables 1 and 2 
for the Ns of the cross-validation samples in each field of study.) 

Statistical analyses. Multiple discriminant analysis was the sta- 
tistical technique used. The functions obtained from the discrim- 
inant analysis of the criterion samples were applied to the cross- 
validation samples for predictive purposes. The analyses were run 
separately for men and women students. Analyses in the study 
were facilitated by the use of a standard discriminant analysis com- 


puter program and computed on an IBM 7044 computer at the 
University of Iowa Computer Center. 


fferences (.05 level) 
ation samples on five of the six 
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which his discrimination score was the highest. The results of the 
predictions for the male cross-validation samples are summarized 
in Table 1. 

The number of correct predictions obtained follow the diagonal, 
and the percentages of correct predictions are shown as “hits” in 
| the right hand column of Table 1. The numbers of correct predic- 
tions expected by chance, based on marginal totals, are shown as the 
numbers in parentheses in Table 1. The frequencies arising from 
| predictions for the male cross-validation samples based upon the 
weights of the male criterion samples exceed the numbers expected 
| to be correctly predicted by chance for engineering, business, letters 
and science, and secondary education fields of study. The frequency 
in the prediction for the pharmacy field is at chance expectation. 
The small sample sizes in the pharmacy field may have accounted 
for some of the inaccuracy in prediction. 

The data from the female criterion samples were submitted to the 
nultiple discriminant analysis and the discriminant weights which 
_ Were obtained were applied to the predictions for the female cross- 
validation samples. The results of the predictions for the female 
cross-validation samples based upon weights of the female criterion 
| samples are shown in Table 2. The prediction as to major field of 
study, based on OIL Fields of Interests variables, exceed the num- 


af 


TABLE 1 


Agreement between Predicted and Actual Field of Study, Male Cross-Validation 
Samples, OII Fields of Interests Data 


Predicted Major je dn » 
j- i- Phar- Letters & mn Jo 
Es ae ney Science Education ‘Total Hits 


4 50. 60 
10 
gi : "i 49 57 
9 
th (9) d n 
(3) (3) 
13 48 27 
(8) (9) 
23 50 46 
( (10) 
62 57 20 34 41 214 


‘Top numbers refer to prediction. made by the aatan s analysis. Numbers in parentheses are 
of predictions expected by chance, based on to 
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TABLE 2 


Agreement between Predicted and Actual Field of Study, Female 
Cross-Validation Samples, OII Fields of Interests Data 


Predicted Major Field 
Actual Major Elementary Letters & % 
Field Nursing Education Science Total Hits 
Nursing 27 7 3 37 73 
(12) (7) (8) 
Elementary 5 34 11 , 
Education (16) (22) (11) 50 68 
Letters & 9 15 14 
Science (12) (17) (9) 38 37 
Total 41 56 28 125 


Note.—Top numbers refer to predictions made by the discriminant analysis. Numbers in 
parentheses are the number of predictions expected by chance, based on marginal totals. 


ber expected by chance in nursing, elementary education, and letters 
and science fields of study. | 
Discussion. The results of this study showed that OII Fields of 
Interests scores obtained from university entrance testing do have 
validity for predicting selected major fields of study at graduation. 
Interestingly, the percentages of “hits” obtained in this study using 
OTI Fields of Interests scores only are essentially the same as the 
percentages of “hits” obtained using OII Fields of Interests, Types 
of Interests, and Level of Interests scores for the same samples of 
students (Stahmann, 1969). Obvious sampling limitations prohibit 
us from generalizing these findings, but point to further research 
using OII variables in prediction problems of this type. i 
A practical implication arising from this and other studies utiliz- 


multiple discriminant analysis is curr 
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THE VALIDITY OF THE CJVS SCALE OF 
EMPLOYABILITY FOR OLDER CLIENTS IN A 
VOCATIONAL ADJUSTMENT WORKSHOP 


ASHER SOLOFF Aw» BRIAN F. BOLTON 
Chicago Jewish Vocational Service 


Tum Chicago Jewish Vocational Service (CJVS) Scale of Em- 
ployability was developed and refined during the years 1957-63 
with support from the Social and Rehabilitation Service. The 
Scale is used to assess the potential employability of disabled per- 
sons. Three scales comprise the Scale of Employability: the Psy- 
chology, Counseling, and Workshop Scales. No effort has been 
made to combine the three into a single score. The original Work- 
shop Scale contains 48 items pertaining to client behavior in & vo- 
cational adjustment workshop. Hach item requires a rating on à 
four-point scale to be made by a counselor-foreman. The scales and 
the results of the 5-year study were published as CJVS Mono- 
graph No. 4 (Gellman, Stern, and Soloff, 1963). 

The vocational adjustment workshop is & technique for rehabili- 
tating handicapped persons for competitive employment. The at- 
mosphere is usually similar to that of a small factory: 4 variety of 
assembly work is subcontracted and clients are paid wages for their 
production. The workshop “foremen” are trained to function ina 

, manner designed simultaneously to test and improve clients’ ca- 
pacities to perform at work. The philosophy and techniques of the 
Vocational adjustment workshop are presented in Gellman and 
Friedman (1964). 

Purpose. The general purpose 
short form of the Workshop Scale 


of this research was to validate a 
of the Scale of Employability with 


Se 
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a sample of older disabled clients. Specifically, the research was 
focused on two concerns: the assessment of change in level of func- 
tioning during workshop tenure and the prediction of eventual out- 
comes. 

Sample. The subjects of this study were all clients in an Older 
Worker Project? The goal of the three-year project was to in- 
erease the employability of disabled persons over 45 years of age 
Who possessed an employment handicap and were judged to be po- 
tentially employable. A majority of the clients were over 60 years of 
age and had some limitation in physical capacity. 

The clients who were assigned to the workshop program were 
considered to be in need of vocational evaluation and adjustment 
services. (Two-thirds of the project elients were referred directly 
for counseling and job placement.) A total of 146 clients were as- 
signed to the workshop program during the project. Only the 64 
clients who remained in the workshop for 17 weeks (full-length 
program) were included in this research sample. This selection pro- 
vided a sample with a larger proportion of the more disabled 
clients. (The success rate was slightly over 40 per cent as compared 
with a success rate of 50 per cent for the total sample of 125 clients 
who spent some time in the workshop.) 

Instrument. The original Workshop Scale was reduced to 25 items 
via an informal item analysis procedure, Five items were selected to 
assess performance in each of four subscales and four items in a 


man, et al., 1963.) 
k pr Mobilize and Direct Energy in Work Situation (1, 
sin Def 


II. Capacity to Tolerate and Cope with Work 
J Pressures, Ten- 
sion, and Demands (11, 13, 14, 15, 17) à 
ae Interpersonal Relations with Foreman (25, 26, 27, 29, 32) 
x ont Level of Ability in Work Situation (35, 36, 37, 
v. ee Relations with Co-Workers (18, 22, 23, 24) 
ingle item dealing with overall functioning (28) 


———— 
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"The subscale intercorrelations are moderately high (all positive). 
mates of two types of reliability for the total scale score are 
lable: internal consistency is .80 or higher and inter-rater re- 
lity is .70. The inter-rater reliability is important for the Work- 
shop Scale because foreman-client assigments are rotated periodi- 


lly. 
M ethodology. The total sample of 64 clients was arbitrarily di- 
vided into two groups: those clients who entered the shop before 
nuary 1, 1967, and those clients who entered after this date. The 
groups were created because it was the impression of the 
nselors and foremen that the average client referred later in 
le project was more handicapped and less employable. (The data 
pport this impression.) Also, it was felt that stronger validation 
evidence would be obtained with two samples. 
The basic design of the research is a repeated measure design 
ating), with ratings taking place at the end of 2, 4, 8, 18, and 
weeks in the workshop. The first four-week period constitutes 
o diagnostic phase and the last 13 weeks the adjustment phase of 
e service. Data were analyzed using the standard ANOVA and 
elational techniques. The outcome criterion directly represents 
e goal of vocational rehabilitation, job placement. Clients who 
placed in employment were successful, the others were con- 
ered not successful. 
^ Results and discussion. The basic descriptive data on the Work- 
op Scale for the two groups of full-term clients are presented in 
ble 1. Each scale score (in the table) is an average for five sep- 
ate ratings. It is readily apparent that the second sample was more 
erogeneous than the first. Nevertheless, the pattern of subscale 
Ores is remarkably consistent for successful and unsuccessful 
oups of clients as well as for the groups totals. All cliente, suc- 
ssful and unsuccessful, were rated highest on Subscale II (Capac- 


y to Tolerate and Cope with Work Demands). Subscale IV (Func- 
] Relations with Co- 


ings for all clients; but Sub- 


me successfully employed. The second best discriminator between 
essful and unsuccessful clients was Subseale I (Ability to Mobil- 
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TABLE 1 
Means and Standard Deviations of the Workshop Scale of Employability 
for Two Samples of Older Disabled Clients» 


Sample I Sample II 
Suc-  Unsuc- Suc-  Unsuc- 
Total cessful cessful Total cessful cessful 
N 38 15 23 26 1 15 
Total Scale 
Mean 61.4 2 600.2 56.4 61.0 53.0 
SD 11.0 10.0 12.0 18.1 17.4 15.0 
Subscales (Means) 
T 63.8 66.4 62.2 57.9 62.6 54.4 
I A P E T 72:8 66.3 66.7 66.1 
Tir 65.4 66.0 65.0 61.1 64.7 58.5 
IV 51.4 56.9 47.8 48.2 56.3 42.4 
v 53.0 51.7 53.9 48.0 53.3 45.2 


* All means are averages of ratings at five points in time. 


ize and Direct Energy), and this, too, is an expected finding. The 
other subscale differences for the second sample may be due to the 
presence of some very depressed clients, 

The ANOVA and related correlational data are presented in 
Table 2. As would be expected, validity coefficients are higher for the 
more heterogeneous second sample. The validity coefficients (point 
biserial correlations) based on the average of the two ratings com- 
pleted during the diagnostic phase were .171 and 330 for the first 
and second samples, respectively. Predictability improved with more 
time to observe: the validity coefficients based on the average of 
five ratings were .287 and 502. 

The attempts to assess the changes that take place in clients 
during the workshop experience were not fruitful. Neither differ- 
ences among the time periods nor the interaction of groups and 
trials produced significant F ratios, We conjecture that the difficulty 
is due to the apparent erratic or inconsistent performance of the 
older workers over time. From an examination of individual client 
trend lines (not presented here) in conjunction with the Table 2 
data, we concluded that the average intra-individual variability ex- 
na re Which could haye been expected from the unreliability 
Eie sona bw m aped measures procedure is not an 
Nobis: A on à ese ata (the intra-individual variabil- 

rm for the trials and trials by groups tests of 
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significance). Following Berdie's (1969) suggestion, the variability 
over trials was computed for each client and correlated with total 
score and outcome. No consistent relationships were found. We are 
not able to suggest a statistical procedure for analyzing change over 
time under the conditions of such great intra-individual variability. 

While the tentative conclusion of inconsistency of individual per- 
formance is troublesome from a statistical point of view, it repre- 
sents an important substantive finding. This finding suggests that 
further research on the sources of such variability (e.g., personal 
variability vs. variability attributable to participation in a long- 
Tange program) is important for learning about the relationship 
between taking part in rehabilitation programs and subsequent vo- 
cational success for older workers. 
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THE MINNESOTA TEACHER ATTITUDE INVENTORY 
AND COUNSELOR-CAMPER INTERACTION: A NOTE ON 
PREDICTIVE VALIDITY 


GENE F. SUMMERS 
University of Illinois 
ARNOLD A. SHUSTER anp SUSAN K. SHUSTER 
Indiana University 


Tum Minnesota Teacher Attitude Inventory (MTAI) is pur- 
portedly a measure of teachers’ attitudes which allows the predic- 
tion of future teacher-pupil interactions. Its development rests on 


, the assumption that the attitudes of a teacher are the key to pre- 


dicting the type of classroom atmosphere he will be able to main- 
tain (Cook, Leeds, and Callis, 1951). The validity of the MTAI has 
been examined in a number of studies with generally affirmative 
resulis (Getzels and Jackson, 1963). 

The study reported here examines the predictive validity of the 
MTAI in counselor-camper interactions. The rationale for our ex- 
pectation that the MTAI will have predictive validity is based 
upon the similarity of teachers and counselors in tasks, age-status 
differences, goals of interaction (ie., learning of skills, attitudes, 
values), role structure, and role relationships. One of the most, im- 
portant characteristics of an effective counselor is his ability to 
establish and maintain harmonious adult-child relationships. The 
type of camp atmosphere is as important to the achievement of 
camp objectives as is the classroom atmosphere to school objectives. 
Thus, the rationale for the predictive validity of the MTAI in 
teacher-pupil interactions is equally applicable to counselor-camper 
interactions. In order to examine the predictive validity of the 
MTAI for counselor-camper interactions data from three day-camps 


Were collected. 
999 
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Procedure. 'The three camps were conducted during the Spring, 
Summer, and Fall of 1967. 

Counselors for the Spring and Fall Camps were undergraduate 
students enrolled in a course in recreation and park administration. 
Twenty-two counselors in the Spring session met for four consecu- 
tive Saturday mornings with 100 children ages eight to twelve; 
twenty-eight counselors in the Fall Camp met for six consecutive 
Saturdays with 54 children ages eight to twelve. 

The Summer Camp was a privately operated day camp. Fourteen 
young adults of varying backgrounds and eight high school students 
Served as counselors for 100 children ages six to thirteen who at- 
tended camp for an eight-weck period (five half-days per week). 

Tn addition to counselors’ MTAI responses, several external cri- 
lerion measures were obtained: (a) The Observers’ Ratings of 
Leadership Style, (b) Camp Director’s Ratings of counselors’ per- 
formance, and (c) Campers’ Satisfaction Ratings. 

The Observers’ Ratings consisted of fifteen statements descriptive 
of leadership style. These were divided into three sets of five items; 
each set of items was written to describe either democratic, laissez 
faire, or authoritarian leadership styles. While the items have face 
validity, more rigorous tests of validity and reliability have not 
been completed. Typical of the items in the sets are those below. 

Democratic: Whenever special advice was needed, the leader 

would offer several alternatives for a group deci- 
sion. 

Laissez faire: — The leader seemed to supply help, only when 

asked. 
Practically all policies affecting the group ac- 
tivity and procedure were determined by the 
leader, 

Tn all three camps, each counselor was observed on three sep- 
arate occasions by three different observers who were unaware of the 


counselors’ MTAT scores, The Observer’ i 
the end of a ten minut cun cone 


Authoritarian; 


zero. This means that Scores on the three dimensions would not be in- 


: 


GENE F. SUMMERS, ET AL. 1001 


dependent. The counselor's total score for each of the three lead- 
ership styles was caleulated by summing the ratings of all Ob- 
servers. Inter-rater reliabilities were satisfactory, although they 
were somewhat low for the laissez faire score (see Table 1). 

The Director’s Rating Scale was a modified version of Leed’s 
original principals’ rating scale (Leeds, 1946). The director of the 
Fall Camp rated each of his counselors on seven dimensions. The 
five response categories were weighted —2 to +2 and the coun- 
selors’ score was the sum of the chosen responses. Directors, as ob- 
servers, were unaware of counselors’ MTAI scores. 

Camper’s Satisfaction Ratings were obtained during the last day 
of the Fall Camp when each camper was interviewed by his Unit 
Leader, Unit Leaders were student supervisors; each had five 
counselors under his supervision. They had no knowledge of coun- 
selors’ MTAI scores. 

The Unit Leader suggested that the camper pretend there was to 
be another camp in the spring. Seven questions were asked which 
linked satisfaction with the fall camp to the imaginary spring camp. 
Three dealt with the counselor; two were concerned with group 
achievement; and two involved satisfaction with the group. Each 
item was scored on a favorable-unfavorable dimension and scores 
were summations of arbitrary weights of 0 and 1 (favorable) for 
each item. The following are examples of the questions to which 


campers responded: 


Would you like to be with new kids or the same kids? 

Would you like to do the same kinds of things as you did at this 
camp or different things? 

Would you like the same counselor or someone else? 


Results—Spring camp. The counselors’ MTAI responses were 
obtained two weeks prior to the beginning of camp which ran four 
weeks. Observations of counselor-camper interactions were made 
throughout the four weeks of camp. Thus, the time between 
counselors’ responses to the MTAI and our criterion measure ranged 
from two to six weeks with an average lag of four weeks. The cor- 
relations of MTAI and observation scores are presented in the first 
column of Table 1. 

It is clear that our expectation of relatively high coefficients of 
predictive validity have been met. All are statistically significant 
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although the coefficient for the MTAI vs. laissez faire correlation 
is lower than the others. This is not surprising in view of the fact 
that the MTAI was not developed to tap this dimension of leader- 
ship style. 

In a majority of previous studies reporting on the predictive va- 
lidity of the MTAI the coefficients were based upon concurrent 
comparisons. The data from the Spring Camp enable us to over- 
come the limitations of concurrent comparisons to a minor degree. 
There was a two-to-six weeks time lapse between measures. How- 
ever, this still leaves the rather strong possibility that the correla- 
tions obtained are spuriously high. The factors which Yee (1967) 
suggested operate to undermine the predictive validity of the MTAI 
may not have had sufficient time to become manifest. Since our con- 
cern was with predictive validity rather than with concurrent va- 
lidity we attempted to maximize, insofar as was possible, the time 
between measures during the Summer Camp. 

Summer camp. The counselors responded to the MTAI one week 
before the beginning of the camp. The camp season consisted of five 
half-day sessions for eight consecutive weeks. The Observers’ 
Checklist scores were based upon observations made during the final 
month of camp, and this resulted in a time lag between the admin- 
istration of the MTAI and the criterion measures of five to nine 
weeks with an average lag of seven weeks. The extended time also 
meant that counselor-camper interactions had greater opportunity 
to become stabilized before observations were made. As can be seen 
by comparing the results in Table 1, essentially the same pattern of 
correlations is maintained in spite of the longer time lapse between 
measures in the summer camp. 

Fall camp. The MTAI was administered to the prospective coun- 
selors five weeks in advance of the commencement of camp. The 
observers’ ratings were obtained during each of the six weekly camp 
sessions, The time lag between the measurement of the MTAI and 
the criterion variables of Observers’ Checklist scores ranged from 
five to 11 weeks with an average lag of eight weeks. The time lag 
between the MTAI and the three additional variables was eleven 
weeks, since they were collected during the final session of camp. 

As in the previous two Camps the relation of MTAI to observed 
styles of leadership was quite substantial. Again the laissez faire 
dimension showed little relation to MTAI. 
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One of the common criterion measures of the MTAI used in previ- 
ous studies has been supervisors’ or principals’ ratings. The ability 
to predict such ratings from the MTAI has been acceptably high. In 
our study the correlation between MTAI responses and Camp Di- 
rector’s rating of the counselors was .75. This coefficient was as high 
as those of nearly all previous studies and higher than most. It was 
also comparable to the MTAI vs. observed leadership style. 

Another commonly used criterion of predictive validity has been 
students’ ratings of their teachers. Our parallel measures (Campers’ 
Satisfaction Ratings) correlate with Counselors’ MTAI response 
-61, 41, and .46 for group achievement, like vs. dislike of counselor, 
and satisfaction with group, respectively. Again the coefficients ob- 
tained were of a magnitude comparable to those of previously re- 
ported studies, 

In summary, the MTAI Tesponses were found to be significantly 
related to observations of counselors’ democratic and authoritarian 
leadership styles in all three camps. In the third camp significant 
relations were found between MTAI and Camp Director’s ratings 
of counselor’s performance and campers’ satisfaction measures. 
These findings lead us to conclude that the MTAI does have pre- 


dictive validity in the Counselor-camper interaction situation as well 
as in teacher-pupil interactions, 
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"Robert H. Bauernfeind. Building a School Testing Program (1969 
E Boston: Houghton Mifflin Company. Pp. xvii + 
. $5.95. 


r Can a publication that has a different colored cover be called 
"new? The following statement appears on page iv of the book 


1969 IMPRESSION 


This printing of “Building a School Testing Program” includes 
some revised materials developed in 1968. Three new high 
school tests of general educational development have been 
listed in Chapter 8; reference to Buros’ M ental Measurements 
Yearbooks have been revised to show listings appearing in the 
Sixth Yearbooks have been revised to show listings appearing 
in the Sicth Yearbook (1965); and test publishers’ addresses 
cited on pages 14-15 have been checked for current accuracy 
with ZIP CODE numbers included. 


The changes that are evident in the 1969 book other than those 
listed above are so minute that the 1963 pagination can be used. 
"The basic changes that were evident in the book are (1) different 
colored cover; (2) different listing of test publishers; (3) listing 
of several new tests’ (4) revision of the exhibit on page 272; (5) 
- listing of the entry number for the Buros’ Sixth Yearbook rather 
than the Fifth Yearbook. Since the book has not been re-written 
nor extensively revised, whatever opinions were formulated to- 
wards it in 1963, can be safely held in 1969. 
In regard to the Buros’ entry number, the reviewer was sur- 
| prised to find that the brief comments that appears about the 
various tests does not reflect a reasonable precise of the review 
found in the Sixth Yearbook. From the layout of the page, the 
“natural assumption is that the overview of the test is an abstract 
from the Sixth Yearbook. Most of the students in a test and mea- 
< surement class made the same assumption. The presumption should 
"have not been made for the test overview is the same for both 
issues of the book. 
_ It is the opinion of the reviewer that additional basic changes 
should have been made. Although the discussion on the reliability 


and validity is good, it should have been structured within the 
1966 Standards for Educational and Psychological Tests and Man- 


"uals, ‘A rather serious omission is that this publication nor the 
i 1007 
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1954 Technical Recommendations for Psychological Tests and Di- 
agnostic Techniques are not mentioned in the book. Since the pur- 
pose of the book is to “take school people on a conducted tour 
of the hundreds and hundreds of published tests available to them 
and to provide suggestions and general guidelines to help them 
plan their school testing programs,” the failure to cite the Stand- 
ards for Educational and Psychological Tests and Manuals deprives 
school personnel with a valuable guide that would enable them to 
make meaningful test reviews, 

Secondly, Chapter 15, entitled “What’s ahead in Educational 
Testing? Projections by Nine Test Specialists” has not been re- 
vised. These opinions were formulated 1961-62. Surely, the pro- 
jections for the 1970's are somewhat different than those formu- 
lated for the 1960. For example, how is the demand to assess con- 


oe eae going to affect the use of typical intelligence 
es 


for advice; (3) uses only one school district to demonstrate how 


to develop an extensive test program; (4) too much stress on how 
to record a score rather than o 


is placed under new publications? 


Henry KAOZKOWSKI 
University of Illinois 
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of research methods. Unfortunately, the execution of this project 
is marred by such serious flaws as to result in a caricature of “how 
psychologists do research.” 

The fourteen chapters include presentations of Freud’s study 
of Little Hans, Thurstone’s and Willoughby’s neuroticism ques- 
tionnaires, Gurin's survey research, Pavlov's and Miller's experi- 
ments on conflict, Mowrer's and Solomon's work on avoidance, 
and Wolpe's conditioning therapy. Alternate chapters present, meth- 
odological exposition and evaluations of the selected studies. 

Since the author explicitly disclaims an attempt at comprehen- 
siveness in presenting research methods, the status of anxiety 
research, or methodological criticism, his particular selection of 
cases could have served well his purpose of providing illustrative 
research on a common topic. However, the choice of the case 
method demands a sensitivity to the significance of case material 
which is totally lacking. Thus, for example, the significance of 
Freud and Pavlov as scientists—their courage, insight, dedication, 
and integrity—is altogether missed, and replaced with pointless 
frivolity (e.g., Freud was “undoubtedly puffing on a cigar” as he 
interviewed Hans.). 

Appearing at a time when serious scientific judgment is un- 
animous in rejecting narrow “experimental” methods as the pri- 
mary source of knowledge, this text is surprisingly dated in tone 
and content. Methodological sections are written as if the volum- 
inous and searching criticism of psychological research. methods 
and the development of alternative methods 
reflecting a Brunswikian conception of psychological inquiry had 
“objectivity” and “control” vir- 


tually preempt any consideration of psychological problems of 


anxiety, and lead to a curious set of priorities in evaluating con- 
tributions to psychological inquiry (eg., Wolpe over Freud, Ber- 
kun over Pavlov). Such critical issues 88 discontinuities between 
human and animal “anxiety”, experimenter bias, or ethical re- 
sponsibilities in research are mentioned briefly, but never seriously 
dealt with. 

In the concluding chapter, the author sketches as “one of 


many that need to be done” an hypothetical study in which 
d conditioning therapy 


relative effectiveness of psychoanalysis an 
in treating neurosis is to be tested by (8) administering unspecified 
post-hypnotic guilt suggestions to 40 normals, (b) having graduate 
students administer unspecified brief equivalents of psychoanalysis 
and reciprocal inhibition therapy, and (c) measuring outcomes 
by decrements in self-ratings of “nervousness” on a single-item 
four-point scale. While one may be touched by the author's 
innocence of such studies (and devastating criticism) in the lit- 


erature of the 1950's, or impressed by his chutzpah in offering 
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this exemplary method, it is difficult to see that students would 
be enlightened by such an introduction to psychological re- 
search, 

While sections of the book contain useful presentations of de- 
tails of research methods, and the general format of the book is 
attractive and readable, the author’s pedestrian, uncritical (and 
often misleading) treatment of substantive issues, and the “talking- 
down” style of exposition are likely to irritate bright students 
and mislead the not-so-bright. Until a supplementary text appears 
which lives up to the undelivered promise of Dustin’s book, most 
instructors will prefer to enrich their introductory courses with 


such standard supplements as Bachrach’s, Miller’s, or Scientific 
American reprints. 


RAE CARLTON 
Educational Testing Service 


William J. Gephart and Robert B. Ingle (Eds.). Educational Re- 
search: Selected Readings. Columbus, Ohio: Charles E. Mer- 
rill Publishing Co., 1969. Pp. x + 454. $7.95. 

Most one- 
search must, 
sumers” rathi 


et apprenticeship, in the opinion of 
researchers. 18 required to turn students into 
ar eel at he me 
should be, Consider three objectives: Ok? educational research 


ctives: 
1. i 
mut M esth Courses and books must stimulate and 
in a subject aan relatively unsophisticated students 
i difficult, t is rigorous and abstract enough to be 
- Realistic expectati d 
fostered. tations of educational research should be 
3. Students should | D 
Tesearch with critical dier gn MU educational 
In my opinion, . : 
compiled bo Geli enn of readings on edueational research 
Ingle can contribute substantially to the 
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achievement of the foregoing objectives. The 44 articles in the 

book are generally well written and brief (the mean length of the 
_ articles is 9 pages; the longest one is 22 pages). They contain very 

little technical jargon or mathematical symbolism, so the relatively 

unsophisticated but reasonably intelligent student who has been 

introduced to only the most basic statistical concepts should be 
| well able to read with comprehension. A feature of many articles 

which enhances their comprehensibility and appeal is the extent 

to which propositions and concepts are elucidated by examples 
| drawn from the experience of the authors. These qualities of the 
readings (which one might have expected, knowing that 15 of 
the papers were drawn from Phi Delta Kappan and The American 
Psychologist) should help to maintain student interest in the ma- 
terial treated in the book. 

The book should fare well too in developing reasonable ex- 
pectations of educational research activities. The nature and pur- 
| pose of research are topics that recur throughout the book. And 

the process of research from identifying problems to framing 
conclusions is described realistically in terms of the demanding 
work it is. Perhaps more important for achieving the second ob- 
jective is the fact that a major difficulty in research, how to achieve 
broadly generalizable results, is described in several articles. After 
| reading the book with understanding, no student should carry 
away the impression that edueation can be revolutionized by the 
results of a few months’ research. On the other hand, the reader 
- should be convinced that an important means to the improvement 
of education is through application of the method of science to 
educational problems. And because some readings have been 
chosen deliberately to present opinions at variance with those ex- 
pressed in other readings, the perceptive student will quickly be- 


help students to picture scientific research as & 
Process, and radi so. The controversies should make ui 
readings more interesting besides. (It is possible, of course, thai 
controversy will only confu: 
to take pains to ensure that this does not occur.) ar 
Achievement of the third objective, to develop the DET 

read (“consume”) literature critically, is at least suppo Ss y 
articles outlining criteria for evaluating each of several apo o 
a piece of research: the problem, the hypothesis (or hypo n in 
the happy event that Platt’s “strong inference sep was 
employed), the design, the statistics, and de conclusion. de 
that the achievement of this third objective 1s ‘at least suppo: 

by the book because a student would require much more know- 
ledge than what is given in the readings themselves to write an 
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adequate critique of a report. However, if the reader of the 
Gephart-Ingle collection already has some depth of knowledge 
in & substantive area (including information about the obser- 
vational techniques and the methods of data analysis used in the 
area), then the readings do provide guidelines for applying this 
knowledge in a critical examination of research. I doubt if more 
can reasonably be expected. 

Educational Research: Selected Readings has several other sound 
features worthy of note. The book is well organized into five 
chapters that follow a logical sequence: 

The nature and logical basis of educational research, problem 
identification and hypothesis development, design and analysis, 
conclusions, and theory development and applicability of find- 


ings. 
A reasonable balance was achieved in the number of articles 
chosen for each chapter. Aside from Chapter 3 (Educational Re- 
search: Design and Analysis) which contains 15 articles, each 
chapter consists of seven or eight articles. Gephart and Ingle 
have written brief introductions to each chapter. These introduce 
the topic covered in the chapter, define its scope, and outline the 
intra-chapter organization of articles. Gephart and Ingle have also 
written introductions for each reading, These are noteworthy for 
pn eye soni el da the main point in the article, and 
of questions for i i 

b aee vi e reader to bear in mind as he 
The book is not, of course, without defects. A very few of the 
i because of the way they are written, the topic they cover, 
MM loose way in which they are organized, make for dreary 
P- [d UM are Comel!'s article entitled Productive Methods 
r oe tke and Shannon’s on a research classification scheme in- 
u v a study of 1000 educational research reports. An 
par oe Fat is that none of the articles is directly con- 
with measurement in Tesearch and the problem of how to 
asuring instrument. Also there are a 


s 
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troduction to Chapter 4 does not agree with the order that sub- 
sequently appears. 

But my discontents are obviously minor ones. As long as 
educators are committed to teach courses on educational research, 


- and as long as they do not delude themselves into thinking the 


broad objeetives of the courses they teach can be more than 
those enunciated earlier in this review, then the volume compiled 
by Gephart and Ingle should be seriously considered as a reference 
work. It lacks, as most books of readings lack, the completeness 
of coverage and the consistency of style and viewpoint desirable 
in a textbook. However, as a secondary or supplementary refer- 
ence, this book will contribute substantially to a students’ knowl- 
edge of the main problems and issues of the research process in 
education. 
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Representing a collection of eighteen papers which the author 
has either presented at professional meetings and/or has pub- 
lished subsequently in the professional literature, Intelligence, 
Creativity, and Their Educational Implications brings together 
nearly two decades of fundamental research based on the author's 
structure-of-intellect (SI) model. From his painstaking search for 
new intellectual abilities through use of factor analytic method- 
ology, Guilford has deseribed not only the theoretical bases of 
his conceptualization of intelligence, but also important applications 
to curriculum development, to the teaching and learning process, 
and to creative endeavor. 

The important implications of his work for measurement and 
evaluation in both education and psychology rest not so much on 
advances in statistical methodology, but rather on the formulation 
of a major theory of intelligence and on the dissemination of 
important findings derived from a large number of empirical 
studies that were aimed at testing this theory. Within an edu- 
cational context Guilford emphasizes the importance to learning 
of teaching for both understandings and intellectual conceptuali- 
zations that reflect a number of logical interrelations and organiza- 
tions in the thinking process. He interprets the learning process 
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as being based on the discovery of functional information rather 
than on mere stimulus-response associations and as representing a 
problem-solving approach that involves the training of individuals 
in the use of numerous, diversified mental abilities. Particular 
stress is placed on the identification and measurement of those 
abilities that generate divergent information as well as maximum 
transfer from one learning context to another. Thus this volume 
has important meaning and interest not only to the psychologist 
who studies the fundamental characteristics of intelligence, but 
also to the educational psychologist who wishes to view the appli- 
cations of the SI model within the framework of classroom learn- 
ing in which creative abilities will be nurtured. 

In the first major section concerned with “Components of In- 
telligence” six papers are reproduced. The first two papers present 
the structure-of-intellect (SI) theory in its persistent form, and 
the third paper entitled "Intelligence: 1965 Model” sets forth 
Guilford's most recent thinking of intelligence as essentially an 
information-processing model in a problem-solving context. In 
addition three other papers offer useful insights regarding the 
nature of memory, the use of the SI model as a device for de- 
scribing learning as the acquisition of new information, and the 
description of ways in which multivariate methods involving the 
study of individual differences can be applied to studies of 
learning. The last three papers by virtue of their concern with 
the learning process Obviously have important educational im- 
plications. 

In the second section entitled “Aspects of Creativity” six papers 
are also included. A broad Spectrum of creativity is furnished in 
that both the nature of creativity including the factors that aid 
and hinder its expression and the relationship of creativity to 
to information theory, to the 


prise, even though the third secti i imari 
concerned with educational captain oo ee’ 
Nes oe ps ia Mi ed entitled *Educational Implica- 
o tria i 
ee m papers. The first triad sets forth 
tions as conceived within the SI model. 


iid be used in curriculum develo t A 
Š Sus. pment. 
description of which SI abilities may be involved in reading and 
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in achievement in ninth grade mathematics and of how such 
knowledge may be useful to teaching is presented in the second 
and third papers of the triad. In the second triad of papers, great 
stress is placed upon how creativity can be developed and on how 
teaching and examination procedures can be facilitated and im- 
proved from knowledge of the SI model. The last paper will be of 
particular interest to teachers in that it spells out in a step-by-step 
fashion just what the basic principles in teaching for creative 
endeavor are and outlines in a systematic way what important 
constructs in the SI model need to be considered. 

In summary, this short volume of collected papers furnishes a 
comprehensive overview of Guilford’s efforts to understand the 
nature of intelligence and affords a balance in theory and practice 
that should be of interest to learning theorists, to innovators 
of educational curriculum and instruction, and to measurement 
and evaluation specialists. Numerous lessons are implicit in these 
chapters for the person who wishes to find a meaningful frame- 
work for guiding his understanding of how individuals learn, to 
teach for creative endeavor, and to develop improved instruments 
for the evaluation of achievement. This excellent little volume 
should soon find its way to the shelves of both the research 
psychologist and the educational innovator, for it has an im- 
portant message for all those psychologists and teachers who are 
concerned with how students learn in individualistic and creative 
ways. 

Joan J. MICHAEL 

California State College, Long Beach 
Wium B. MICHAEL 

University of Southern California 


William A. Mehrens and Irvin J. Lehmann. Standardized Tests 
in Education. New York: Holt, Rinehart and Winston, Inc., 
1969. Pp. xi + 323. $6.95 and $4.95 (paperback). 


Actually the preface of this book could serve well as a review 
of it, for what the reader found promised in the preface really 
was delivered in the text in a clear readable style. That kind of 
writing and summarizing commands respect. 

The content is limited to the things classroom teachers, coun- 
selors, and school administrators must know to select, administer, 
and use standardized tests correctly, with more attention being 
given to cognitive than to noncognitive measures. It contains no 
information about teacher-made tests. It is written for beginning 
students of testing, and as the authors declared in the preface, 
erudite theory is avoided whenever possible. For instance, no 
formal course work in measurement or statistics is necessary to 
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understand the content of the book; what is needed is given as it 
is needed. 

The charmingly simple style of the writing in Standardized 
Tests in Education will have great appeal to the student. Of special 
merit is the constant use of practical examples to illustrate theo- 
retical points and the symbol-by-symbol step-by-step revelation 
of the formulae which need to be understood in order to read 
test manuals intelligently. At the end of each chapter is a summary 
of it and a section called “Points to Ponder.” These points, very 
useful study guides, are among the many pieces of evidence that 
identify the authors as experienced and effective teachers. 

The first chapter “Measurement in Education” is the foundation 
for the other four, containing treatment of the basic principles of 
testing as well as the statistical concepts which are necded to 
understand reliability and validity. (This reviewer has never seen 
& better beginner's explanation of those two concepts.) The three 
middle chapters treat aptitude tests, achievement tests, and in- 
terest, personality and attitude inventories. (The authors make a 
real effort to give a clear simple differentiation among these 
types of instruments, and yet they are not at all simplistic.) A 
final chapter “Educational Testing: A Broader Viewpoint” gives 
consideration to testing first as a school wide program, then as 
seen by the parents and the publie in general, and finally as a 
tool for the future. In the appendix are a useful list of test pub- 
lishers and a glossary of measurement terms. 

If I were choosing a text for a course in standardized testing, I 
would give very serious consideration to Standardized Tests in 
Education, If Y were a school administrator, I would order one for 
the offices of all those who are involved in the selection of in- 
struments for the school testing program. If I were studying for 


my doctoral comprehensives i i 
it and go on from there, ln measurement, I would memorize 


Sister Jacinta Mann, S.C. 
Seton Hill College 


M is cot LANE P ROPA (Eds. Annual Review 
d e ubi : 
Reviews, Inc. 1969, Pp.x+ 516. Fm Rd. Ces Annual 
ost ens in. Volume 20 of the Annual Review of Psychology 
Rs Paste sh greatest interest to readers of EDUCATIONAL 
aie PASUREMENT include such critical summaries 
mu ersonality” by Joseph Adelson, “Attitudes and 
ead o iy 0. Sears and Ronald P. Abeles, *Human 
ENA T da A. Fleishman and C. J. Bartlett, and “In- 
Sychology” by Robert M. Gagné and William D. 
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Rohwer, Jr., “Personnel Selection” by William A. Owens and 
Donald O. Jewell and “Scaling” by Joseph L. Zinnes. 

Among the other selections also likely to be of interest to our 
readers are “Developmental Psychology” by J ohn H. Flavell and 
John P. Hill and “Basic Drives” by Byron A. Campbell and James 
R. Misanin. 

Joseph Adelson’s critical summary of recent research in the 
field of “Personality” begins with an emphasis on the diversity of 
research in this area. "The impulse for synthesis, for finding 
unities, has for the moment been set aside. He calls attention 
to contemporary criticism of experimental methodology which 

challenges the validity of the experimental method itself.” Among 

other criticisms, mention is made of the questionable ethics of 
the use of deception in experimentation and the excessive use of 
undergraduates, volunteers, and captive samples of children. It is 
emphasized that we need a revival of inductive and naturalistic 
approaches. (In this connection see Chapter III of The Logic of 
the Sciences and the Humanities by F. S. C. Northrop. The 
research later summarized is relevant to morality, effectiveness or 
competence, self-esteem and self-acceptance, achievement and per- 
formance, anxiety and aggression, and other aspects of per- 
sonality. 

David O. Sears and Ronald P. Abeles’ summary of recent re- 
search on “Attitudes and Opinions” begins with mention of such 
important sources as several of the chapters in the revised Hand- 
book of Social Psychology, Theories of Cognitive Consistency: 
A Source Book, Advances in Experimental Social Psychology, 
and new editions of Readings in Social Psychology and of Group 
Dynamics. 

The research summarized is relevant to credibility and attri- 
bution theory (under what conditions is a stated opinion con- 
sistent with the speaker’s true attitude?), the relations between 
a communicator’s position and a subject’s original opinion, and 
the various factors in attitude change. These authors agree wi 
Adelson that there is a loss of confidence in experimentation both 
with reference to validity and a need for concern about deception’s 
effects on experimental subjects and the reputation of the ex- 
perimenter. Edwin A. Fleishman and C. J. Bartlett’s summary of re- 
search on “Human Abilities” intentionally excludes research rele- 
vant to methodologies, for example, factor analysis and instru- 
ments in order to concentrate on “only those studies which help 
illuminate more general issues of human ability.” 

Under the heading “Theoretical and Conceptual Issues” ability 
is first defined and contrasted with skill or achievement. “Abilities 
are seen as representing à class of ‘mediating processes’, identified 


through combinations of experimental and correlational research, 
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in terms of score consistencies among separate performances.” 
Mention is made of recent contributions to the understanding of 
the nature of intelligence—those of Guilford, Guttman, Cattell, 
and Humphreys. Cattell’s “Crystallized intelligence” and ‘Fluid 
intelligence” are expertly explained and attention is called to 
Humphrey’s comparison of them with heirarchical-group factor 
theories of Vernon and Burt. This section concludes with an 
interesting effort to relate the contributions of Piaget to those 
mentioned above. j s 

Under “Organization of Abilities” a number studies are cited 
relating ability measures to measures of performance. For example, 
Stolurow reviewed evidence that “Programs featuring knowl- 
edge of results, overt responding, and immediate feedback make 
a difference for low ability students, but that high ability students 
do just as well in programs without these features.” Other studies 
cited in this section are concerned with relationships between 
abilities and learning, transfer of training, changes in factorial 
composition of performance variables with practice. Other sections 
of this scholarly summary are entitled “Abilities, Task Require- 
ments, and Laboratory Measurement,” “Effects of Environments,” 
“The Effects of Drugs,” “Age and Sex Differences,” and “Cultural 
and Ethnic Differences.” The authors conclude that the “current 
educational and social ferment” has provided “new challenges for 
research on human abilities in the years ahead.” 

“Research in Instructional Psychology” is reviewed by Robert 
M. Gagné and William D. Rohwer, Jr. Introductory paragraphs 
are followed by the main content of the review organized ac- 
cording to Gagné’s “events of instruction” or conditions external 
to the learner which can be manipulated to influence learning. 
Under “Gaining and Maintaining Attention” research relevant 
to factors in instruction which bring about, 
attentional sets is reviewed. “The 
texts. to be studied is a 
experimental problems, . . » Under 
the Learner" 


presentation, and contexts in which the stimuli ted. 
Later sections of the review have th oe 


Feedback,” “Promotin Retention,” i i 
“Conditions Affecting Transfer" > And the very important topic 


William A. Owens and Donald O. Jewell have summarized the 
recent research on Personnel Selection. 


X This review begins with 
a thoughtful discussion of vari 
selection. Thes various models relevant to personnel 


e include the selection-rejection model and the 
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lassification model. The latter is more compatible with the max- 
ma utilization philosophy. It is stated that multiple-selection 
nodels approach the classification model which is feasible only by 
irms having a great diversity of jobs. 

"The review continues with discussion of the Fair Employment 
hilosophy and the problems of bias and invasion of privacy. 
ollowing discussion of research concerning recruitment, a major 
jart of the review deals with measures and methods useful in 
personnel selection. The review concludes with discussion of 
uch sources as Dunnette’s Personnel Selection and Placement and 
vith an excellent summary of trends in personnel selection. 

Joseph L. Zinnes’ summary of studies of scaling begins by 
loting the publication of books by Bock and Jones, by Pfanzagl, 
Joombs, Lord and Novick, Lazarfeld and Henry, and Frederick- 
jen and Gulliksen. The major section of the review are “Pair- 
Comparison Theories and Experiment”, “Logic of Measure- 
ment”, “Conjoint Measurement”, “Multidimensional Models”, 
“Power Versus Logarithmic Functions”, and "Detection." 

It is to be hoped that readers of EDUCATIONAL AND PsycHo- 
LOGICAL MEASUREMENT will find what is said above a stimulus to 
eading the reviews briefly summarized, Volume 20 of the Annual 
Review of Psychology as a whole, like its predecessors, is in- 
dispensable in maintaining and broadening professional knowledge 


psychology. 


Max D. ENGELHART 
Duke University 


Phillip J. Rulon, David V. Tiedeman, Maurice M. Tatsuoka, 
and Charles R. Langmuir. Multivariate Statistics for Personnel 
Classification. New York: Wiley, 1967. Pp. xi + 406. $12.95. 


e statisticians in the Graduate School of 


L For many years th i 
have pioneered in the develop- 


Education at Harvard University 7 
ment of methods of simple and multiple discriminant analysis. At 
ong last Rulon, Tiedeman, Tatsuoka, and Langmuir have prepared 
) is the systematic develop- 
ment of methods of discriminant analysis of psychological test data 
f a power test in which the 
vel, the topic of dis- 


‘to its most profound an 


"pages this text generates Bel 
d multiple discriminant ana 


both the geometric and analytic founda- 
lysis as it relates to 
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tion on personnel classification in relation to the psychological test 
profile, Chapters 2, 3, 4, and 5 in the second section are, respectively, 
concerned with the one-, two-, and three-, and n-invariate cases of 
the Cartesian geometry of the psychological test profile. Within 
this geometric development appropriate matrix notation is intro- 
duced, explained, and illustrated. The individual with limited train- 
ing in matrix algebra and in multivariate statistics cannot expect 
to start reading meaningfully in the middle of the text without 
having studied carefully the previous foundational material. 

The third section of the volume is devoted to one chapter about 
the topology of personnel classification in which the use of centour 
scores is explained as a means of ascertaining the numbers of cor- 
rect and incorrect classifications of individuals into two or more 
criterion groups. The centour score affords a useful index of the 
degree of similarity between the profile of scores of individuals be- 
ing considered for classification within one or more jobs and the 
profiles of existing groups already holding membership in each of 
several jobs. Many useful geometric and numerical charts are pre- 
sented to clarify the use of centour scores which in a two-demen- 
sional test or discriminant Space can be portrayed as a series of 
concentric ellipses with a centroid—each family of concentric el- 
lipses embracing linear combinations of predictor scores in a test 
or discriminant space for individuals in a particular job classifica- 
tion. Drawings of overlapping families of ellipses permit estimates 
of the conditional probabilities of being classified in one or more 
job categories as a function of the distance that a person falling 
On à given centour is from the centroid of the particular family of 
ellipses relative to a specific job category given that the individual 
is really a member of that job group. 

In the fourth section, three chapters are devoted to the problem 
of reducing the dimensionality of the measures employed. Separate 
chapters on factor analysis, discriminant analysis, and regression 
analysis are presented, the main concepts are effectively interrelated. 

In section five Chapter 10 is devoted to combini i 


criterion variable in ee essentially binary character of the 
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within an historical perspective. Basic problems of personnel selec- 


goups which may be used for problem-solving activities of the 
reader, The last three pages of the text contain a bibliography of 
42 references pertinent to the classification problem. 

The authors have, for the most part, furnished convincing evi- 
dence of the utility of the discriminant analysis approach. They 
have almost intentionally overlooked the important work of Cron- 


tion imposed by quotas of personnel 
categories has not been satisfactorily treated. Perhaps one of the 
most dissatisfying aspects of discriminant analysis that the au- 
thors tend to ignore is that the factors or dimensions derived are 
| not easy to interpret; in fact, the first dimension which approxi- 
mates a general factor usually accounts for most of the variance. 


| Rarely are more than two or three factors practical in the inter- 


pretation of the dimensionality of discriminant solutions. What is 
very much needed is a series of empirical studies in which Horst’s 
multiple differential regression analyses and the multiple discrim- 
inant analyses developed in this book, can be compared with 
one another—especially in à cross-validation context—so that the 


relative merits of each approach can be assessed. 
The reviewer recommends most highly this volume to the readers 
of EDUCATIONAL AND PSYCHOLOGI s 
dividuals with different methodological bia 
| optimistic claims which the authors have mai à 
| technique of discriminant analysis, they may find nevertheless 
| much useful material and may identify certain unresolved issues 
| which will challenge them to offer alternative methodologies or to 
introduce modifications and innovations in discriminant analysis 
that will make this approach both more effective and more mean- 
ingful in the classification of personnel. 
Wurm B. MICHAEL M 
University of Southern Califorma 


Delwyn G. Schubert in consultation with Theodore L. Torgerson. 

A Dictionary of Terms and Concepts in. Reading. (2nd an 
Springfield, Illinois: Charles C Thomas, 1969. Pp. xv + 376. 
$7.00. 
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This quite physically small (514”h x 394"w x 114") tome 
written by an education professor, is intended for the general stu- 
dent of reading. It contains over 1,950 entries of which approxi- 
mately one-third deal with reading as a skill or skill area. The 
other two-thirds of the entries are divided among optimetrie, 
measurement, curriculum, medical, and a few psychological terms. 

The lexicon could possibly be profitably used in conjunction with 
various undergraduate or beginning graduate courses in the teach- 
ing of reading. However, the text would serve those college instruc- 
tors who emphasize the physiological aspects of the reading act 
bos more adequately than those who are more eclectically in- 
clined, 

Information given in the preface claims that the book presents 
the reader with definitions that go beyond those usually tendered 
by providing clarifying examples from time to time. Unfortunately, 
in reviewer's opinion, most of the definitions are not elaborate 
or flexible enough to completely validate this claim, Indeed, it 
Would appear appropriate for the author to entertain for a third 
edition, a much more comprehensive dictionary replete with ex- 
tended definitions and with lists of Teferences citing the source ma- 
terial for the entries included, 


JosEPH C. Jounson II 
Duke University 


Julian L. Simon. Basic Research Methods in Social Science: The 


Art of Empirical Investigati 3 
1900. Petes eas apnd tion. New York: Random House, 


This book is a good example of trying to please all the people 
d the time, Simon's subject is research: plübiécphy and folklore, 
tegies and statistics, for all social Scientists. Simon uses a dis- 
cursive, casual style of writing ("Now let us talk about numbers. 
peor : Suggestion has the unfortunate effect of frightening 
of you nearly out, of your wits. Relax; this discussion will 


„Of course, for what purpose is à 
525-page M tiga ed of such generality appropriate? It is meant “pri- 


dents who ha re wit. 
tific research" (P. 3, rides done empirical social-scien 


the section on "obetaclen here the book jacket) clearly consider 
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a nd particularly the treatment of statistics to be the high points of 
the book. We will concentrate on the sections on causality and 
F statistics after a brief summary of the whole book. 

In Part I Simon introduces such concepts as operational def- 
inition, variables, sampling, ceteris paribus (“all other things 
equal”), and types of research, from “descriptive” to “mapping 
systems.” The section on operational definitions suffers from some 
infelicitous examples—a road map is not an operational definition 
of New York, nor a good definition of how to get to New York— 
which is not surprising in a book so rich in examples. Simon also 
introduces a thing called “systematic random sampling” (see pp. 
41; 256-57) an antinomy no less contradictory than the legendary 
clean-shaven barber who shaves everyone who doesn’t shave him- 

‘self. Generally, the first section is a competently done, well- 
| illustrated, but superficial introduction to the basic ideas of em- 
Pirical research. 

_ In Part II the concept of obstacles to research is introduced: 


When I say there are obstacles I mean that the world is big, 
complex, numerous, expensive, and tiring to try to understand, 
not least because human nature is so complex. (p. 77) 


This part is without doubt the most interesting in the book. Simon 
discusses the obstacles of humanness of the observer and the ob- 
served, samples, time and geography, causality, and cost. It is 
filled with examples of amusing, disreputable, and ingenious re- 
‘search fiascoes from many fields. The examples are vivid enough 
| to convince even the neophyte of the complexities of research, 

- Part III, “Decisions and Procedures," is the inevitable result of 
ing to write for everyone from freshmen to professors and 


P 


informative comments, warnings, or bits of advice scattered about. 
In Part III Simon covers procedures for any research study (Ch. 
43: Step 1. Ask “What Do I Want to Find Out?"), experimenta 
and surveys (Ch. 16), and longitudinal vs. cross-sectional studies 
"(Ch. 19); and decisions about variables (Ch. 14), the value of 
“Projected research (Ch. 15), sampling and experimental design 
(Ch. 17) and classifying vs. measuring (Ch. 20); and an intro- 
duction to various other topics in Chs. 18 and 21, including de- 
“ductive reasoning, the case study, "leaning" punched cards and 
‘index numbers. d 

Part IV contains the sections on causality and statistics to be 


discussed later. desi i 
- Part V is a short chapter on Professor Simon's views of what 
SCIENCE is (and isn't); his call for “purposeful research— 
Which is meant to go beyond the traditional basic-applied dis- 
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tinction; and a short discussion of the special nature of social 
science research. Kaplan's Conduct of Inquiry (1964) contains 
an excellent coverage of these and other topies mentioned briefly 
in Simon. 

Simon's treatment of statistical methods is unique in textbooks 
in the social sciences. The mean and range are discussed, and 
bivariate data are graphed. However, practically none of the con- 
ventional descriptive statistics are discussed. All inferential sta- 
tistical questions are handled with permutation (randomization) 
techniques. Apparently Simon is attracted by the simplicity and 


Unfortunately, Simon’s students will be unable to read the re- 
search literature in their fields, Simon confessed in a footnote 


phrases like “ : : there are too many 
specific or rw enough," and "if enough” to make it at all 


causation when manipulation is p sible ment as a paradigm of 
ent is: 
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If the response follows the experimental stimulus and if this 
experimental relationship persists when other elements of the 
situation are varied, the experimentally observed. relationship 
may be called “causal.” (p. 441) 


Of course, Simon knows that there are an infinite number of 
| other elements in the situation to be varied (there are [theoreti- 
cally] an infinite number of times of day, for that matter), but 
he gives no indication—because he can’t—of how many or what 
sorts of “other elements” must be manipulated. This problem leaves 
the paradigm sadly circular. 

The culmination of the discussion of causation when manipu- 
lation is not possible is the following definition: 


A statement shall be called “gausal” if the relationship is close 
enough to be useful or interesting; if it does not require so 
many statements of side conditions as to gut its generality 
and importance; if enough possible third-factor variables have 
been tried to provide some assurance that the relationship is not 
spurious; and if the relationship can be deductively connected 
to a larger body of theory or (less satisfactorily) be supported 
by a set of auxiliary propositions that explain the mechanism by 
which the relationship works. (p. 454) 


The first criterion, that the relationship be close enough to be 
useful or interesting, is doubtful. That the fly-swatter killed the 
fly is certainly a causal statement, though it is neither useful nor 
interesting. There are many good reasons for not writing about 
uninteresting, useless relationships; they may not be “scientific,” 
but they may certainly be causal. Even when a long chain of 
events is interposed between an uninteresting cause and a useless 
effect there is no reason to deny causality. 

The criterion of few side conditions is particularly interesting. 
Certainly, a long list of "ifs" detracts from the usefulness and 
generality of a causal claim, but in many—perhaps most—experi- 
ments, the “ifs” are still there, they are just controlled. 

The criticism of the “enough possible third factor variables” 
criterion is exactly the same here as in the experimental case. 

The criterion of deducability from a larger body of theory is a 
very difficult one. Deducing 4 relationship from phrenology or 
Ptolemaic theory would not enhance jt. It is perhaps more im- 
portant that observations confirm theories than vice-versa; how- 
ever, anything that makes an event plausible—an explanation, & 
theory, or merely having happened before—tends to confirm it. 

The final section on causality and antecedence is perhaps only 
rd usage. Simon argues for backward 


a matter of opinion or WO : b 
causation—something now caused by something that won + happen 
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until tomorrow. We find this position unsatisfactory. Simon cites 
the example of Christmas causing the earlier rise in toy sales. It 
would not be plausible to say that Christmas caused toy-buying 
in November if the world were suddenly destroyed on December 
23: Simon would not, we suppose, advocate causation by a non- 
existent cause. 

Simon’s second example, that of the (future) laying of eggs 
causing a bird to build her nest is difficult for another reason. If 
one accepts reverse-time causation, one has satisfactorily explained 
nest-building. If one sticks to the ordinary temporal order, one is 
compelled to do a great deal of investigation into “instinctual” 
behavior—unless one accepts the equally comforting alternative 
criticised by Simon, that the bird “expects” her young. Thus, we 
feel that a definition of causality without a time-order restriction 
results in too easy explanations. Events which do not fit into the 


bet pont SOTENCE, imu aot as well as Kaplan; but he has 

work with whic i ienti i 
Bate Ressarth M ob cis Scientists can be trained. 
an academy. It can give 
can’t give him the skills 
sciences. 


Nancy W. Burton 
Gene V Grass 


Laboratory of Educational Research 
University of Colorado 


James G. Snider and Charles E, Osgood (Eds 


ential Techn: AS i 
zi es 6i rond ourcebook. Chicago 


.). Semantic Differ- 
: Aldine, 1969. Pp. 
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of Meaning by Osgood, Suci and Tannenbaum. The proliferation 
of semantic differential studies in the last decade has created a 
need for a general reference work on this topic. The editor's 
aim for the present work, as stated in the Preface, is to bring 
together "for the first time in one volume as many as possible of 
the materials that adequately and accurately illustrate the origins, 


history, criticism, methodology, validity, and specific uses of the 
"semantic differential technique.” 


The influence of Osgood’s theory and method upon modern 
psychology has been remarkable. The impact of his work has 
been felt all the way from the ivory towers of pure experimental 
psychology, to the consulting rooms of clinical psychology and 


- the sweat-shops of consumer psychology. How can such a strong 


and diversified response be explained? Was it the brilliant insights 


of Osgood and his co-workers, or was it the Zeitgeist? The reply, 
as usual, must be “both”—the ideas were stimulating and the time 


was ripe. 

The reviewer has been impressed by two general emphases in 
Osgood's work. The first is the recognition that, the *tconnotative" 
or "affective" meanings of stimuli are often of much greater im- 
portance than their formal or “denotative” meanings in the de- 
termination of behavior. As a classroom illustration, the reviewer 
has suggested to students that, if they were told that a communist 
was waiting outside to talk to them, their subsequent behavior 
would be more a function of the connotative or affective meanings 
of the term communist (“bad,” “strong,” and “active”) than of 
its denotative or “dictionary” definition. The second major em- 
phasis has been the search for, and demonstration of, a general 
meaning system which transcends the boundaries between con- 
ventional classes of stimuli, and shows promise of by-passing some 
of the particularism which has plagued modern psychology. Thus, 
in the area of attitude research, simple semantic differential Evalu- 
ation ratings seem to get as much the same thing as do the 
laboriously derived individual scales of attitude toward this, that, 
and the other. Perhaps in psychology, as in the other sciences, 
the truly important conceptualizations serve to simplify rather 
than complicate. 

Turning to the other side of the coin, it seems clear in retrospect 
that the “spirit of the times” in the late 1950's provided a re- 
ceptivity to the ideas Osgood was proposing. By this time an 
increasing number of psychologists were becoming disenchanted 
with a psychology which spoke only the “language of behavior” 
and were asking, along with Carl Rogers (1964), whether one 
could develop a “new phenomenology” which would combine 
the richness of “subjective” response with the rigor of objective 
demonstration. These psychologists sensed that the hard-won gains 


1008 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of methodological behaviorism might be employed in an effort 
to understand how persons view the world in which they live. 
The objectivity (and generality) of the semantic differential 
method, and Osgood’s theoretical emphasis upon the phenomeno- 
logical concept of Meaning, seemed to provide just the mix of 
rigor and significance which many psychologists were seeking. 
The result was the outpouring of semantic differential studies from 
which Snider and Osgood have selected the contents of the present 
volume. 

The Sourcebook is organized into ten major parts, Part I includes 
three major background selections, the last of which consists of 
excerpts from the Measurement of Meaning. Part II is a somewhat 
novel section consisting of reviews which the Measurement of 
Meaning received following its publication in 1957. The review 
by Weinrich from the Journal Word, together with Osgood’s 
rejoinder and Weinrich’s reply, anticipates both the influence of 
Osgood’s work in the area of linguistics and the controversy which 
continues to swirl around the meaning of the “meaning” which is 
measured by the semantic differential (Kuusinen, 1969; Miron, 
1969; Osgood, 1969). Parts III and IV of the volume are devoted 
to methodological and validity studies of the semantie differential 
technique such as Messicks? paper, “Metric Properties of the 
Semantie Differential" Ford and Meisels’ paper, “Social Desir- 
ability and the Semantic Differential,” and Flavell's two papers 
on “Meaning and Meaning Similarity.” 

. Part V is devoted to a sampling of cross-cultural studies and 
includes Osgood’s 1964 paper from the American Anthropologist 
ational research project 
the demonstration of the pan-cultural nature 
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chotherapy, prose style, interpersonal communication in industry, 
and the connotations of psychological journals to their readers. 

Three other characteristics of the volume deserve comment. 
First, there is the forty page bibliography, containing over 1400 
references, which provides for the first time a comprehensive 
listing of the literature on the semantic differential. Second, an 
appendix provides an Atlas for which gives connotative meaning 
data for 550 concepts. Third, there is the Introduction in which 
Osgood shares his impressions and recollections concerning the 
origin and early development of his interest in meaning and its 
measurement, 

The Sourcebook will provide a useful reference work for all 
researchers who use the semantic differential. Perhaps even more 


important may be the provocative effect of the book upon the 


next generation of researchers who may find even more creative 


uses for this remarkably versatile technique. 


Jonn E. WILLIAMS 
Wake Forest University 
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With the increased emphasis and effort in educational research, 
at least quantitatively, the need for more competent. researchers 
has never been more urgent. In this third edition Professor Travers, 
through elimination of some of the technical material of his pre- 
vious editions, has attempted to broaden the appeal of his text to 
include masters degree students. The primary objectives of this 
volume remain the same as the previous editions: the development 
of researchers and critical reviewers of research. 

In this new edition the author has rewritten much of the ma- 
terial in previous editions to achieve a simpler presentation. The 
central theme of research based on theory as having most lasting 
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value still remains. New chapters on computers, content analysis, 
and classroom devices for recording and analysis have been added 
supplementing chapters on the main approaches to research in the 
behavioral sciences, i.e., survey, prediction, experimental, historical, 
case study, and environmental. The chapter on suggested pro- 
cedures for conducting classroom research is a welcome addition 
which should be of value even to the novice researcher. As 
should be, the author emphasizes the importance of planning for 
research and the pitfalls to avoid. These areas are very strong 
aspects of this book. 

Even though methodology in educational research has not 
changed significantly over the last few decades, the revolution in 
technology that enables today’s researcher to process the results 
of his research faster and with greater accuracy has. This volume 
presents in well organized detail material on data handling and 
processing. Careful preparation of source documents for machine 
processing is discussed. The importance of this step cannot be 
minimized because of its impact on research results. Travers offers 
a good general description of the concepts underlying computers 
as well as educational applications of computers. The chapters 
discussing these topics should become standard in all research 
books. The computer has been and should continue to be in- 
dispensable for all research involving massive data. 

In his endeavor for broad appeal of this text and possibly 
because of his stress on research concepts, Travers has not heavily 
emphasized strategies or models for experimental research. Even 
though implications for the use of statistical tools is present, the 
treatment of both descriptive and inferential statistics is lacking. 


Tf not compensated for with supplementary material, the question 

bana a w the Eno in the knowledge and development of re- 
ers. Even wi e availability of co; “statisti k- 

ages,” the use of these tools is y mputer “statistical pac 


a requirement of a competent 

researcher. Both masters de tud 
should have tates ordi gree students and doctoral students 
As in his previous editions, 


L Professor T 
inta most meaningful way a ssor Travers has presented 


; n concepts of edueational research. 
His emphasis on theory and basie research should be of immense 


value to basic and applied researchers, alik H 
1 a x has succeeded 
in broadening the appeal of his text, he "i the 
students will profit from the use of this MEM eS aaa, 
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